* xfs: fix CIL push hang in for-next tree @ 2021-06-15 6:46 Dave Chinner 2021-06-15 6:46 ` [PATCH 1/2] xfs: add iclog state trace events Dave Chinner ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Dave Chinner @ 2021-06-15 6:46 UTC (permalink / raw) To: linux-xfs Hi folks, This is the first fix for the problems Brian has reported from generic/019. This has fixed the hang, but the other log recovery problem he reported is still present (seen once with these patches in place). I've tested these out to a couple of hundred cycles of continual looping generic/019 before the systems fall over with a perag reference count underrun at unmount after a shutdown. I'm pretty sure the hang is fixed, as it would manifest within 10-20 cycles without this patch. The first patch is the iclogbuf state tracing I used to capture the iclogbuf wrapping state. The second patch is the fix. Cheers, Dave. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/2] xfs: add iclog state trace events 2021-06-15 6:46 xfs: fix CIL push hang in for-next tree Dave Chinner @ 2021-06-15 6:46 ` Dave Chinner 2021-06-15 15:26 ` Darrick J. Wong 2021-06-15 15:37 ` Brian Foster 2021-06-15 6:46 ` [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL Dave Chinner 2021-06-15 17:57 ` xfs: fix CIL push hang in for-next tree Darrick J. Wong 2 siblings, 2 replies; 14+ messages in thread From: Dave Chinner @ 2021-06-15 6:46 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> For the DEBUGS! Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log.c | 18 ++++++++++++++++ fs/xfs/xfs_trace.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index e921b554b683..54fd6a695bb5 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -524,6 +524,7 @@ __xlog_state_release_iclog( iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); xlog_verify_tail_lsn(log, iclog, tail_lsn); /* cycle incremented when incrementing curr_block */ + trace_xlog_iclog_syncing(iclog, _RET_IP_); return true; } @@ -543,6 +544,7 @@ xlog_state_release_iclog( { lockdep_assert_held(&log->l_icloglock); + trace_xlog_iclog_release(iclog, _RET_IP_); if (iclog->ic_state == XLOG_STATE_IOERROR) return -EIO; @@ -804,6 +806,7 @@ xlog_wait_on_iclog( { struct xlog *log = iclog->ic_log; + trace_xlog_iclog_wait_on(iclog, _RET_IP_); if (!XLOG_FORCED_SHUTDOWN(log) && iclog->ic_state != XLOG_STATE_ACTIVE && iclog->ic_state != XLOG_STATE_DIRTY) { @@ -1804,6 +1807,7 @@ xlog_write_iclog( unsigned int count) { ASSERT(bno < log->l_logBBsize); + trace_xlog_iclog_write(iclog, _RET_IP_); /* * We lock the iclogbufs here so that we can serialise against I/O @@ -1950,6 +1954,7 @@ xlog_sync( unsigned int size; ASSERT(atomic_read(&iclog->ic_refcnt) == 0); + trace_xlog_iclog_sync(iclog, _RET_IP_); count = xlog_calc_iclog_size(log, iclog, &roundoff); @@ -2488,6 +2493,7 @@ xlog_state_activate_iclog( int *iclogs_changed) { ASSERT(list_empty_careful(&iclog->ic_callbacks)); + trace_xlog_iclog_activate(iclog, _RET_IP_); /* * If the number of ops in this iclog indicate it just contains the @@ -2577,6 +2583,8 @@ xlog_state_clean_iclog( { int iclogs_changed = 0; + trace_xlog_iclog_clean(dirty_iclog, _RET_IP_); + dirty_iclog->ic_state = XLOG_STATE_DIRTY; xlog_state_activate_iclogs(log, &iclogs_changed); @@ -2636,6 +2644,7 @@ xlog_state_set_callback( struct xlog_in_core *iclog, xfs_lsn_t header_lsn) { + trace_xlog_iclog_callback(iclog, _RET_IP_); iclog->ic_state = XLOG_STATE_CALLBACK; ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), @@ -2717,6 +2726,7 @@ xlog_state_do_iclog_callbacks( __releases(&log->l_icloglock) __acquires(&log->l_icloglock) { + trace_xlog_iclog_callbacks_start(iclog, _RET_IP_); spin_unlock(&log->l_icloglock); spin_lock(&iclog->ic_callback_lock); while (!list_empty(&iclog->ic_callbacks)) { @@ -2736,6 +2746,7 @@ xlog_state_do_iclog_callbacks( */ spin_lock(&log->l_icloglock); spin_unlock(&iclog->ic_callback_lock); + trace_xlog_iclog_callbacks_done(iclog, _RET_IP_); } STATIC void @@ -2827,6 +2838,7 @@ xlog_state_done_syncing( spin_lock(&log->l_icloglock); ASSERT(atomic_read(&iclog->ic_refcnt) == 0); + trace_xlog_iclog_sync_done(iclog, _RET_IP_); /* * If we got an error, either on the first buffer, or in the case of @@ -2899,6 +2911,8 @@ xlog_state_get_iclog_space( atomic_inc(&iclog->ic_refcnt); /* prevents sync */ log_offset = iclog->ic_offset; + trace_xlog_iclog_get_space(iclog, _RET_IP_); + /* On the 1st write to an iclog, figure out lsn. This works * if iclogs marked XLOG_STATE_WANT_SYNC always write out what they are * committing to. If the offset is set, that's how many blocks @@ -3056,6 +3070,7 @@ xlog_state_switch_iclogs( { ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE); assert_spin_locked(&log->l_icloglock); + trace_xlog_iclog_switch(iclog, _RET_IP_); if (!eventual_size) eventual_size = iclog->ic_offset; @@ -3138,6 +3153,8 @@ xfs_log_force( if (iclog->ic_state == XLOG_STATE_IOERROR) goto out_error; + trace_xlog_iclog_force(iclog, _RET_IP_); + if (iclog->ic_state == XLOG_STATE_DIRTY || (iclog->ic_state == XLOG_STATE_ACTIVE && atomic_read(&iclog->ic_refcnt) == 0 && iclog->ic_offset == 0)) { @@ -3225,6 +3242,7 @@ xlog_force_lsn( goto out_error; while (be64_to_cpu(iclog->ic_header.h_lsn) != lsn) { + trace_xlog_iclog_force_lsn(iclog, _RET_IP_); iclog = iclog->ic_next; if (iclog == log->l_iclog) goto out_unlock; diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 71dca776c110..79ee973ba1ed 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -24,6 +24,7 @@ struct xlog_ticket; struct xlog_recover; struct xlog_recover_item; struct xlog_rec_header; +struct xlog_in_core; struct xfs_buf_log_format; struct xfs_inode_log_format; struct xfs_bmbt_irec; @@ -3927,6 +3928,57 @@ DEFINE_EVENT(xfs_icwalk_class, name, \ DEFINE_ICWALK_EVENT(xfs_ioc_free_eofblocks); DEFINE_ICWALK_EVENT(xfs_blockgc_free_space); +DECLARE_EVENT_CLASS(xlog_iclog_class, + TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip), + TP_ARGS(iclog, caller_ip), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(uint32_t, state) + __field(int32_t, refcount) + __field(uint32_t, offset) + __field(unsigned long long, lsn) + __field(unsigned long, caller_ip) + ), + TP_fast_assign( + __entry->dev = iclog->ic_log->l_mp->m_super->s_dev; + __entry->state = iclog->ic_state; + __entry->refcount = atomic_read(&iclog->ic_refcnt); + __entry->offset = iclog->ic_offset; + __entry->lsn = be64_to_cpu(iclog->ic_header.h_lsn); + __entry->caller_ip = caller_ip; + ), + TP_printk("dev %d:%d state 0x%x refcnt %d offset %u lsn 0x%llx caller %pS", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->state, + __entry->refcount, + __entry->offset, + __entry->lsn, + (char *)__entry->caller_ip) + +); + +#define DEFINE_ICLOG_EVENT(name) \ +DEFINE_EVENT(xlog_iclog_class, name, \ + TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip), \ + TP_ARGS(iclog, caller_ip)) + +DEFINE_ICLOG_EVENT(xlog_iclog_activate); +DEFINE_ICLOG_EVENT(xlog_iclog_clean); +DEFINE_ICLOG_EVENT(xlog_iclog_callback); +DEFINE_ICLOG_EVENT(xlog_iclog_callbacks_start); +DEFINE_ICLOG_EVENT(xlog_iclog_callbacks_done); +DEFINE_ICLOG_EVENT(xlog_iclog_force); +DEFINE_ICLOG_EVENT(xlog_iclog_force_lsn); +DEFINE_ICLOG_EVENT(xlog_iclog_get_space); +DEFINE_ICLOG_EVENT(xlog_iclog_release); +DEFINE_ICLOG_EVENT(xlog_iclog_switch); +DEFINE_ICLOG_EVENT(xlog_iclog_sync); +DEFINE_ICLOG_EVENT(xlog_iclog_syncing); +DEFINE_ICLOG_EVENT(xlog_iclog_sync_done); +DEFINE_ICLOG_EVENT(xlog_iclog_want_sync); +DEFINE_ICLOG_EVENT(xlog_iclog_wait_on); +DEFINE_ICLOG_EVENT(xlog_iclog_write); + #endif /* _TRACE_XFS_H */ #undef TRACE_INCLUDE_PATH -- 2.31.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] xfs: add iclog state trace events 2021-06-15 6:46 ` [PATCH 1/2] xfs: add iclog state trace events Dave Chinner @ 2021-06-15 15:26 ` Darrick J. Wong 2021-06-15 15:37 ` Brian Foster 1 sibling, 0 replies; 14+ messages in thread From: Darrick J. Wong @ 2021-06-15 15:26 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Jun 15, 2021 at 04:46:57PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > For the DEBUGS! > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log.c | 18 ++++++++++++++++ > fs/xfs/xfs_trace.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 70 insertions(+) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index e921b554b683..54fd6a695bb5 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -524,6 +524,7 @@ __xlog_state_release_iclog( > iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); > xlog_verify_tail_lsn(log, iclog, tail_lsn); > /* cycle incremented when incrementing curr_block */ > + trace_xlog_iclog_syncing(iclog, _RET_IP_); > return true; > } > > @@ -543,6 +544,7 @@ xlog_state_release_iclog( > { > lockdep_assert_held(&log->l_icloglock); > > + trace_xlog_iclog_release(iclog, _RET_IP_); > if (iclog->ic_state == XLOG_STATE_IOERROR) > return -EIO; > > @@ -804,6 +806,7 @@ xlog_wait_on_iclog( > { > struct xlog *log = iclog->ic_log; > > + trace_xlog_iclog_wait_on(iclog, _RET_IP_); > if (!XLOG_FORCED_SHUTDOWN(log) && > iclog->ic_state != XLOG_STATE_ACTIVE && > iclog->ic_state != XLOG_STATE_DIRTY) { > @@ -1804,6 +1807,7 @@ xlog_write_iclog( > unsigned int count) > { > ASSERT(bno < log->l_logBBsize); > + trace_xlog_iclog_write(iclog, _RET_IP_); > > /* > * We lock the iclogbufs here so that we can serialise against I/O > @@ -1950,6 +1954,7 @@ xlog_sync( > unsigned int size; > > ASSERT(atomic_read(&iclog->ic_refcnt) == 0); > + trace_xlog_iclog_sync(iclog, _RET_IP_); > > count = xlog_calc_iclog_size(log, iclog, &roundoff); > > @@ -2488,6 +2493,7 @@ xlog_state_activate_iclog( > int *iclogs_changed) > { > ASSERT(list_empty_careful(&iclog->ic_callbacks)); > + trace_xlog_iclog_activate(iclog, _RET_IP_); > > /* > * If the number of ops in this iclog indicate it just contains the > @@ -2577,6 +2583,8 @@ xlog_state_clean_iclog( > { > int iclogs_changed = 0; > > + trace_xlog_iclog_clean(dirty_iclog, _RET_IP_); > + > dirty_iclog->ic_state = XLOG_STATE_DIRTY; > > xlog_state_activate_iclogs(log, &iclogs_changed); > @@ -2636,6 +2644,7 @@ xlog_state_set_callback( > struct xlog_in_core *iclog, > xfs_lsn_t header_lsn) > { > + trace_xlog_iclog_callback(iclog, _RET_IP_); > iclog->ic_state = XLOG_STATE_CALLBACK; > > ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), > @@ -2717,6 +2726,7 @@ xlog_state_do_iclog_callbacks( > __releases(&log->l_icloglock) > __acquires(&log->l_icloglock) > { > + trace_xlog_iclog_callbacks_start(iclog, _RET_IP_); > spin_unlock(&log->l_icloglock); > spin_lock(&iclog->ic_callback_lock); > while (!list_empty(&iclog->ic_callbacks)) { > @@ -2736,6 +2746,7 @@ xlog_state_do_iclog_callbacks( > */ > spin_lock(&log->l_icloglock); > spin_unlock(&iclog->ic_callback_lock); > + trace_xlog_iclog_callbacks_done(iclog, _RET_IP_); > } > > STATIC void > @@ -2827,6 +2838,7 @@ xlog_state_done_syncing( > > spin_lock(&log->l_icloglock); > ASSERT(atomic_read(&iclog->ic_refcnt) == 0); > + trace_xlog_iclog_sync_done(iclog, _RET_IP_); > > /* > * If we got an error, either on the first buffer, or in the case of > @@ -2899,6 +2911,8 @@ xlog_state_get_iclog_space( > atomic_inc(&iclog->ic_refcnt); /* prevents sync */ > log_offset = iclog->ic_offset; > > + trace_xlog_iclog_get_space(iclog, _RET_IP_); > + > /* On the 1st write to an iclog, figure out lsn. This works > * if iclogs marked XLOG_STATE_WANT_SYNC always write out what they are > * committing to. If the offset is set, that's how many blocks > @@ -3056,6 +3070,7 @@ xlog_state_switch_iclogs( > { > ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE); > assert_spin_locked(&log->l_icloglock); > + trace_xlog_iclog_switch(iclog, _RET_IP_); > > if (!eventual_size) > eventual_size = iclog->ic_offset; > @@ -3138,6 +3153,8 @@ xfs_log_force( > if (iclog->ic_state == XLOG_STATE_IOERROR) > goto out_error; > > + trace_xlog_iclog_force(iclog, _RET_IP_); > + > if (iclog->ic_state == XLOG_STATE_DIRTY || > (iclog->ic_state == XLOG_STATE_ACTIVE && > atomic_read(&iclog->ic_refcnt) == 0 && iclog->ic_offset == 0)) { > @@ -3225,6 +3242,7 @@ xlog_force_lsn( > goto out_error; > > while (be64_to_cpu(iclog->ic_header.h_lsn) != lsn) { > + trace_xlog_iclog_force_lsn(iclog, _RET_IP_); > iclog = iclog->ic_next; > if (iclog == log->l_iclog) > goto out_unlock; > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index 71dca776c110..79ee973ba1ed 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -24,6 +24,7 @@ struct xlog_ticket; > struct xlog_recover; > struct xlog_recover_item; > struct xlog_rec_header; > +struct xlog_in_core; > struct xfs_buf_log_format; > struct xfs_inode_log_format; > struct xfs_bmbt_irec; > @@ -3927,6 +3928,57 @@ DEFINE_EVENT(xfs_icwalk_class, name, \ > DEFINE_ICWALK_EVENT(xfs_ioc_free_eofblocks); > DEFINE_ICWALK_EVENT(xfs_blockgc_free_space); > > +DECLARE_EVENT_CLASS(xlog_iclog_class, > + TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip), > + TP_ARGS(iclog, caller_ip), > + TP_STRUCT__entry( > + __field(dev_t, dev) > + __field(uint32_t, state) > + __field(int32_t, refcount) > + __field(uint32_t, offset) > + __field(unsigned long long, lsn) > + __field(unsigned long, caller_ip) > + ), > + TP_fast_assign( > + __entry->dev = iclog->ic_log->l_mp->m_super->s_dev; > + __entry->state = iclog->ic_state; > + __entry->refcount = atomic_read(&iclog->ic_refcnt); > + __entry->offset = iclog->ic_offset; > + __entry->lsn = be64_to_cpu(iclog->ic_header.h_lsn); > + __entry->caller_ip = caller_ip; > + ), > + TP_printk("dev %d:%d state 0x%x refcnt %d offset %u lsn 0x%llx caller %pS", > + MAJOR(__entry->dev), MINOR(__entry->dev), > + __entry->state, Any chance you want to embed XLOG_STATE_* decoding instead of dumping the raw value here? Otherwise, looks fine to me, Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D > + __entry->refcount, > + __entry->offset, > + __entry->lsn, > + (char *)__entry->caller_ip) > + > +); > + > +#define DEFINE_ICLOG_EVENT(name) \ > +DEFINE_EVENT(xlog_iclog_class, name, \ > + TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip), \ > + TP_ARGS(iclog, caller_ip)) > + > +DEFINE_ICLOG_EVENT(xlog_iclog_activate); > +DEFINE_ICLOG_EVENT(xlog_iclog_clean); > +DEFINE_ICLOG_EVENT(xlog_iclog_callback); > +DEFINE_ICLOG_EVENT(xlog_iclog_callbacks_start); > +DEFINE_ICLOG_EVENT(xlog_iclog_callbacks_done); > +DEFINE_ICLOG_EVENT(xlog_iclog_force); > +DEFINE_ICLOG_EVENT(xlog_iclog_force_lsn); > +DEFINE_ICLOG_EVENT(xlog_iclog_get_space); > +DEFINE_ICLOG_EVENT(xlog_iclog_release); > +DEFINE_ICLOG_EVENT(xlog_iclog_switch); > +DEFINE_ICLOG_EVENT(xlog_iclog_sync); > +DEFINE_ICLOG_EVENT(xlog_iclog_syncing); > +DEFINE_ICLOG_EVENT(xlog_iclog_sync_done); > +DEFINE_ICLOG_EVENT(xlog_iclog_want_sync); > +DEFINE_ICLOG_EVENT(xlog_iclog_wait_on); > +DEFINE_ICLOG_EVENT(xlog_iclog_write); > + > #endif /* _TRACE_XFS_H */ > > #undef TRACE_INCLUDE_PATH > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] xfs: add iclog state trace events 2021-06-15 6:46 ` [PATCH 1/2] xfs: add iclog state trace events Dave Chinner 2021-06-15 15:26 ` Darrick J. Wong @ 2021-06-15 15:37 ` Brian Foster 2021-06-15 22:47 ` Dave Chinner 1 sibling, 1 reply; 14+ messages in thread From: Brian Foster @ 2021-06-15 15:37 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Jun 15, 2021 at 04:46:57PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > For the DEBUGS! > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log.c | 18 ++++++++++++++++ > fs/xfs/xfs_trace.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 70 insertions(+) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index e921b554b683..54fd6a695bb5 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -524,6 +524,7 @@ __xlog_state_release_iclog( > iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); > xlog_verify_tail_lsn(log, iclog, tail_lsn); > /* cycle incremented when incrementing curr_block */ > + trace_xlog_iclog_syncing(iclog, _RET_IP_); > return true; > } > > @@ -543,6 +544,7 @@ xlog_state_release_iclog( > { > lockdep_assert_held(&log->l_icloglock); > > + trace_xlog_iclog_release(iclog, _RET_IP_); > if (iclog->ic_state == XLOG_STATE_IOERROR) > return -EIO; > > @@ -804,6 +806,7 @@ xlog_wait_on_iclog( > { > struct xlog *log = iclog->ic_log; > > + trace_xlog_iclog_wait_on(iclog, _RET_IP_); Seems like this might be more informative if we actually wait. > if (!XLOG_FORCED_SHUTDOWN(log) && > iclog->ic_state != XLOG_STATE_ACTIVE && > iclog->ic_state != XLOG_STATE_DIRTY) { > @@ -1804,6 +1807,7 @@ xlog_write_iclog( > unsigned int count) > { > ASSERT(bno < log->l_logBBsize); > + trace_xlog_iclog_write(iclog, _RET_IP_); > > /* > * We lock the iclogbufs here so that we can serialise against I/O > @@ -1950,6 +1954,7 @@ xlog_sync( > unsigned int size; > > ASSERT(atomic_read(&iclog->ic_refcnt) == 0); > + trace_xlog_iclog_sync(iclog, _RET_IP_); > > count = xlog_calc_iclog_size(log, iclog, &roundoff); > > @@ -2488,6 +2493,7 @@ xlog_state_activate_iclog( > int *iclogs_changed) > { > ASSERT(list_empty_careful(&iclog->ic_callbacks)); > + trace_xlog_iclog_activate(iclog, _RET_IP_); > I wonder a bit whether these state change oriented tracepoints could be served by a single trace_ixlog_iclog_state() or some such, since the tracepoint already prints the state. That wouldn't show the before -> state in each tracepoint, but that should reflect in the full log. > /* > * If the number of ops in this iclog indicate it just contains the > @@ -2577,6 +2583,8 @@ xlog_state_clean_iclog( > { > int iclogs_changed = 0; > > + trace_xlog_iclog_clean(dirty_iclog, _RET_IP_); > + > dirty_iclog->ic_state = XLOG_STATE_DIRTY; > > xlog_state_activate_iclogs(log, &iclogs_changed); > @@ -2636,6 +2644,7 @@ xlog_state_set_callback( > struct xlog_in_core *iclog, > xfs_lsn_t header_lsn) > { > + trace_xlog_iclog_callback(iclog, _RET_IP_); > iclog->ic_state = XLOG_STATE_CALLBACK; > > ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn), > @@ -2717,6 +2726,7 @@ xlog_state_do_iclog_callbacks( > __releases(&log->l_icloglock) > __acquires(&log->l_icloglock) > { > + trace_xlog_iclog_callbacks_start(iclog, _RET_IP_); > spin_unlock(&log->l_icloglock); > spin_lock(&iclog->ic_callback_lock); > while (!list_empty(&iclog->ic_callbacks)) { > @@ -2736,6 +2746,7 @@ xlog_state_do_iclog_callbacks( > */ > spin_lock(&log->l_icloglock); > spin_unlock(&iclog->ic_callback_lock); > + trace_xlog_iclog_callbacks_done(iclog, _RET_IP_); > } > > STATIC void > @@ -2827,6 +2838,7 @@ xlog_state_done_syncing( > > spin_lock(&log->l_icloglock); > ASSERT(atomic_read(&iclog->ic_refcnt) == 0); > + trace_xlog_iclog_sync_done(iclog, _RET_IP_); > > /* > * If we got an error, either on the first buffer, or in the case of > @@ -2899,6 +2911,8 @@ xlog_state_get_iclog_space( > atomic_inc(&iclog->ic_refcnt); /* prevents sync */ > log_offset = iclog->ic_offset; > > + trace_xlog_iclog_get_space(iclog, _RET_IP_); > + > /* On the 1st write to an iclog, figure out lsn. This works > * if iclogs marked XLOG_STATE_WANT_SYNC always write out what they are > * committing to. If the offset is set, that's how many blocks > @@ -3056,6 +3070,7 @@ xlog_state_switch_iclogs( > { > ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE); > assert_spin_locked(&log->l_icloglock); > + trace_xlog_iclog_switch(iclog, _RET_IP_); > > if (!eventual_size) > eventual_size = iclog->ic_offset; > @@ -3138,6 +3153,8 @@ xfs_log_force( > if (iclog->ic_state == XLOG_STATE_IOERROR) > goto out_error; > > + trace_xlog_iclog_force(iclog, _RET_IP_); > + We have a log force tracepoint a few lines up. Do we need both? > if (iclog->ic_state == XLOG_STATE_DIRTY || > (iclog->ic_state == XLOG_STATE_ACTIVE && > atomic_read(&iclog->ic_refcnt) == 0 && iclog->ic_offset == 0)) { > @@ -3225,6 +3242,7 @@ xlog_force_lsn( > goto out_error; > > while (be64_to_cpu(iclog->ic_header.h_lsn) != lsn) { > + trace_xlog_iclog_force_lsn(iclog, _RET_IP_); > iclog = iclog->ic_next; > if (iclog == log->l_iclog) > goto out_unlock; > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index 71dca776c110..79ee973ba1ed 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -24,6 +24,7 @@ struct xlog_ticket; > struct xlog_recover; > struct xlog_recover_item; > struct xlog_rec_header; > +struct xlog_in_core; > struct xfs_buf_log_format; > struct xfs_inode_log_format; > struct xfs_bmbt_irec; > @@ -3927,6 +3928,57 @@ DEFINE_EVENT(xfs_icwalk_class, name, \ > DEFINE_ICWALK_EVENT(xfs_ioc_free_eofblocks); > DEFINE_ICWALK_EVENT(xfs_blockgc_free_space); > > +DECLARE_EVENT_CLASS(xlog_iclog_class, > + TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip), > + TP_ARGS(iclog, caller_ip), > + TP_STRUCT__entry( > + __field(dev_t, dev) > + __field(uint32_t, state) > + __field(int32_t, refcount) > + __field(uint32_t, offset) > + __field(unsigned long long, lsn) > + __field(unsigned long, caller_ip) > + ), > + TP_fast_assign( > + __entry->dev = iclog->ic_log->l_mp->m_super->s_dev; > + __entry->state = iclog->ic_state; > + __entry->refcount = atomic_read(&iclog->ic_refcnt); > + __entry->offset = iclog->ic_offset; > + __entry->lsn = be64_to_cpu(iclog->ic_header.h_lsn); > + __entry->caller_ip = caller_ip; > + ), > + TP_printk("dev %d:%d state 0x%x refcnt %d offset %u lsn 0x%llx caller %pS", > + MAJOR(__entry->dev), MINOR(__entry->dev), > + __entry->state, It might be nice to see a string representation of the state like we do for other things like log item type, etc. Brian > + __entry->refcount, > + __entry->offset, > + __entry->lsn, > + (char *)__entry->caller_ip) > + > +); > + > +#define DEFINE_ICLOG_EVENT(name) \ > +DEFINE_EVENT(xlog_iclog_class, name, \ > + TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip), \ > + TP_ARGS(iclog, caller_ip)) > + > +DEFINE_ICLOG_EVENT(xlog_iclog_activate); > +DEFINE_ICLOG_EVENT(xlog_iclog_clean); > +DEFINE_ICLOG_EVENT(xlog_iclog_callback); > +DEFINE_ICLOG_EVENT(xlog_iclog_callbacks_start); > +DEFINE_ICLOG_EVENT(xlog_iclog_callbacks_done); > +DEFINE_ICLOG_EVENT(xlog_iclog_force); > +DEFINE_ICLOG_EVENT(xlog_iclog_force_lsn); > +DEFINE_ICLOG_EVENT(xlog_iclog_get_space); > +DEFINE_ICLOG_EVENT(xlog_iclog_release); > +DEFINE_ICLOG_EVENT(xlog_iclog_switch); > +DEFINE_ICLOG_EVENT(xlog_iclog_sync); > +DEFINE_ICLOG_EVENT(xlog_iclog_syncing); > +DEFINE_ICLOG_EVENT(xlog_iclog_sync_done); > +DEFINE_ICLOG_EVENT(xlog_iclog_want_sync); > +DEFINE_ICLOG_EVENT(xlog_iclog_wait_on); > +DEFINE_ICLOG_EVENT(xlog_iclog_write); > + > #endif /* _TRACE_XFS_H */ > > #undef TRACE_INCLUDE_PATH > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] xfs: add iclog state trace events 2021-06-15 15:37 ` Brian Foster @ 2021-06-15 22:47 ` Dave Chinner 2021-06-16 14:55 ` Brian Foster 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2021-06-15 22:47 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Tue, Jun 15, 2021 at 11:37:49AM -0400, Brian Foster wrote: > On Tue, Jun 15, 2021 at 04:46:57PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > For the DEBUGS! > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_log.c | 18 ++++++++++++++++ > > fs/xfs/xfs_trace.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 70 insertions(+) > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > > index e921b554b683..54fd6a695bb5 100644 > > --- a/fs/xfs/xfs_log.c > > +++ b/fs/xfs/xfs_log.c > > @@ -524,6 +524,7 @@ __xlog_state_release_iclog( > > iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); > > xlog_verify_tail_lsn(log, iclog, tail_lsn); > > /* cycle incremented when incrementing curr_block */ > > + trace_xlog_iclog_syncing(iclog, _RET_IP_); > > return true; > > } > > > > @@ -543,6 +544,7 @@ xlog_state_release_iclog( > > { > > lockdep_assert_held(&log->l_icloglock); > > > > + trace_xlog_iclog_release(iclog, _RET_IP_); > > if (iclog->ic_state == XLOG_STATE_IOERROR) > > return -EIO; > > > > @@ -804,6 +806,7 @@ xlog_wait_on_iclog( > > { > > struct xlog *log = iclog->ic_log; > > > > + trace_xlog_iclog_wait_on(iclog, _RET_IP_); > > Seems like this might be more informative if we actually wait. This tells us the state at the time the call is made, so we know from that if we waited or not. Being able to see all calls to this function was instrumental in understanding the stiuation where we were waiting and when we weren't waiting. This is the iclog we got stuck waiting on: xlog_iclog_wait_on: dev 259:0 state 0x3 refcnt 0 offset 32256 lsn 0x2300004be0 caller xlog_cil_push_work The state is XLOG_STATE_DONE_SYNC, indicating IO had been completed, and the lsn tells me it is a future iclog we are waiting on, which tells us that it's waiting for a previous iclog to complete IO before it can move forward... > > if (!XLOG_FORCED_SHUTDOWN(log) && > > iclog->ic_state != XLOG_STATE_ACTIVE && > > iclog->ic_state != XLOG_STATE_DIRTY) { > > @@ -1804,6 +1807,7 @@ xlog_write_iclog( > > unsigned int count) > > { > > ASSERT(bno < log->l_logBBsize); > > + trace_xlog_iclog_write(iclog, _RET_IP_); > > > > /* > > * We lock the iclogbufs here so that we can serialise against I/O > > @@ -1950,6 +1954,7 @@ xlog_sync( > > unsigned int size; > > > > ASSERT(atomic_read(&iclog->ic_refcnt) == 0); > > + trace_xlog_iclog_sync(iclog, _RET_IP_); > > > > count = xlog_calc_iclog_size(log, iclog, &roundoff); > > > > @@ -2488,6 +2493,7 @@ xlog_state_activate_iclog( > > int *iclogs_changed) > > { > > ASSERT(list_empty_careful(&iclog->ic_callbacks)); > > + trace_xlog_iclog_activate(iclog, _RET_IP_); > > > > I wonder a bit whether these state change oriented tracepoints could be > served by a single trace_ixlog_iclog_state() or some such, since the > tracepoint already prints the state. That wouldn't show the before -> > state in each tracepoint, but that should reflect in the full log. I disagree. We lose a -lot- of context by getting rid of the per-call site trace names, especially those that get called from multiple places. Not to mention tracing an ACTIVE, idle iclog can result in zero state information except for the tracepoint name... e.g: 384.204393: xlog_iclog_write: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004b60 caller xlog_sync 384.204406: xlog_iclog_sync_done: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004b60 caller xlog_ioend_work 384.204407: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 0 lsn 0x0 caller xlog_write_get_more_iclog_space 384.204408: xlog_iclog_switch: dev 259:0 state 0x0 refcnt 1 offset 0 lsn 0x2300004ba0 caller xlog_state_get_iclog_space 384.204418: xlog_iclog_release: dev 259:0 state 0x1 refcnt 1 offset 32256 lsn 0x2300004ba0 caller xlog_write_get_more_iclog_space 384.204419: xlog_iclog_syncing: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_write_get_more_iclog_space 384.204419: xlog_iclog_sync: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_state_release_iclog 384.204422: xlog_iclog_write: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_sync 384.204435: xlog_iclog_sync_done: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_ioend_work See how the xlog_iclog_get_space trace has zero state/lsn? That indicates that it has actually found an idle, active iclog in the correct state for it to hand out. We can also see this progression is through xlog_write() filling mulitple iclogs. However, we can get different progressions from different contexts such as: 384.193766: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 0 lsn 0x0 caller xlog_write_get_more_iclog_space 384.193778: xlog_iclog_release: dev 259:0 state 0x0 refcnt 1 offset 27352 lsn 0x23000042b0 caller xlog_write 384.193779: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 27352 lsn 0x23000042b0 caller xlog_write 384.193780: xlog_iclog_wait_on: dev 259:0 state 0x0 refcnt 0 offset 0 lsn 0x0 caller xlog_cil_push_work 384.193781: xlog_iclog_release: dev 259:0 state 0x0 refcnt 1 offset 27364 lsn 0x23000042b0 caller xlog_cil_push_work 384.201268: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 27364 lsn 0x23000042b0 caller xlog_write .... the competion of a checkpoint where xlog_write() releases the iclog and it's not a full iclog. This is followed immediately by the commit record write which gets iclog space from a different caller location, passes it back to xlog_cil_push_work which then releases it. ANd then we try to wait on the previous iclog which is an idle ACTIVE buffer, so we skip the wait. And because this isn't a "stable CIL push", we don't see a xlog_iclog_switch() trace, and the so iclog is not flushed to disk when it is release here. IOWs, if we don't have the individual trace names, we largely lose all this context and then have to infer it from caller and state changes that occur. That just makes things unnnecessarily difficult when looking for problems in hundreds of thousands trace events.... > > @@ -3138,6 +3153,8 @@ xfs_log_force( > > if (iclog->ic_state == XLOG_STATE_IOERROR) > > goto out_error; > > > > + trace_xlog_iclog_force(iclog, _RET_IP_); > > + > > We have a log force tracepoint a few lines up. Do we need both? Yes. The log force tracepoint tells us nothing about the iclog state we are about to operate on. > > + TP_ARGS(iclog, caller_ip), > > + TP_STRUCT__entry( > > + __field(dev_t, dev) > > + __field(uint32_t, state) > > + __field(int32_t, refcount) > > + __field(uint32_t, offset) > > + __field(unsigned long long, lsn) > > + __field(unsigned long, caller_ip) > > + ), > > + TP_fast_assign( > > + __entry->dev = iclog->ic_log->l_mp->m_super->s_dev; > > + __entry->state = iclog->ic_state; > > + __entry->refcount = atomic_read(&iclog->ic_refcnt); > > + __entry->offset = iclog->ic_offset; > > + __entry->lsn = be64_to_cpu(iclog->ic_header.h_lsn); > > + __entry->caller_ip = caller_ip; > > + ), > > + TP_printk("dev %d:%d state 0x%x refcnt %d offset %u lsn 0x%llx caller %pS", > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > + __entry->state, > > It might be nice to see a string representation of the state like we do > for other things like log item type, etc. Yup, I thought about that after I sent when looking at other code. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] xfs: add iclog state trace events 2021-06-15 22:47 ` Dave Chinner @ 2021-06-16 14:55 ` Brian Foster 0 siblings, 0 replies; 14+ messages in thread From: Brian Foster @ 2021-06-16 14:55 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Wed, Jun 16, 2021 at 08:47:09AM +1000, Dave Chinner wrote: > On Tue, Jun 15, 2021 at 11:37:49AM -0400, Brian Foster wrote: > > On Tue, Jun 15, 2021 at 04:46:57PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > For the DEBUGS! > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/xfs_log.c | 18 ++++++++++++++++ > > > fs/xfs/xfs_trace.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 70 insertions(+) > > > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > > > index e921b554b683..54fd6a695bb5 100644 > > > --- a/fs/xfs/xfs_log.c > > > +++ b/fs/xfs/xfs_log.c > > > @@ -524,6 +524,7 @@ __xlog_state_release_iclog( > > > iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn); > > > xlog_verify_tail_lsn(log, iclog, tail_lsn); > > > /* cycle incremented when incrementing curr_block */ > > > + trace_xlog_iclog_syncing(iclog, _RET_IP_); > > > return true; > > > } > > > > > > @@ -543,6 +544,7 @@ xlog_state_release_iclog( > > > { > > > lockdep_assert_held(&log->l_icloglock); > > > > > > + trace_xlog_iclog_release(iclog, _RET_IP_); > > > if (iclog->ic_state == XLOG_STATE_IOERROR) > > > return -EIO; > > > > > > @@ -804,6 +806,7 @@ xlog_wait_on_iclog( > > > { > > > struct xlog *log = iclog->ic_log; > > > > > > + trace_xlog_iclog_wait_on(iclog, _RET_IP_); > > > > Seems like this might be more informative if we actually wait. > > This tells us the state at the time the call is made, so we know > from that if we waited or not. Being able to see all calls to this > function was instrumental in understanding the stiuation where we > were waiting and when we weren't waiting. This is the iclog we got > stuck waiting on: > > xlog_iclog_wait_on: dev 259:0 state 0x3 refcnt 0 offset 32256 lsn 0x2300004be0 caller xlog_cil_push_work > > The state is XLOG_STATE_DONE_SYNC, indicating IO had been completed, > and the lsn tells me it is a future iclog we are waiting on, which > tells us that it's waiting for a previous iclog to complete IO > before it can move forward... > Then this tracepoint would have still triggered, wouldn't it? What's the great significance in seeing a tracepoint for a function call in cases where that call is otherwise essentially a no-op? My point is that from a general standpoint (i.e. once this particular issue is addressed), I find it obscured to see a bunch of blah_wait* tracepoints in a full trace dump because that explicitly suggests that something actually waited, when in reality, it's just a byproduct of an unfortunately named (i.e. based on the function name) function call based tracepoint. Thinking more about it, this is the exact type of thing I almost always end up modifying or replacing with my own trace_printk() when debugging anything that already requires reproduction to root cause, because otherwise I have to sift through more context just to filter out function call noise from the actual meaningful calls. There's also subtle details like how this doesn't reflect shutdown state (so technically this doesn't actually provide all the information required to determine whether a block actually occurred), or that your _syncing() tracepoint is in a very similar situation, yet in that case you opted to bind the tracepoint with the relevant logic vs. to the function call/name. All I'm asking for is to do a similar thing here by moving this one down a few lines. What's the problem? > > > if (!XLOG_FORCED_SHUTDOWN(log) && > > > iclog->ic_state != XLOG_STATE_ACTIVE && > > > iclog->ic_state != XLOG_STATE_DIRTY) { > > > @@ -1804,6 +1807,7 @@ xlog_write_iclog( > > > unsigned int count) > > > { > > > ASSERT(bno < log->l_logBBsize); > > > + trace_xlog_iclog_write(iclog, _RET_IP_); > > > > > > /* > > > * We lock the iclogbufs here so that we can serialise against I/O > > > @@ -1950,6 +1954,7 @@ xlog_sync( > > > unsigned int size; > > > > > > ASSERT(atomic_read(&iclog->ic_refcnt) == 0); > > > + trace_xlog_iclog_sync(iclog, _RET_IP_); > > > > > > count = xlog_calc_iclog_size(log, iclog, &roundoff); > > > > > > @@ -2488,6 +2493,7 @@ xlog_state_activate_iclog( > > > int *iclogs_changed) > > > { > > > ASSERT(list_empty_careful(&iclog->ic_callbacks)); > > > + trace_xlog_iclog_activate(iclog, _RET_IP_); > > > > > > > I wonder a bit whether these state change oriented tracepoints could be > > served by a single trace_ixlog_iclog_state() or some such, since the > > tracepoint already prints the state. That wouldn't show the before -> > > state in each tracepoint, but that should reflect in the full log. > > I disagree. We lose a -lot- of context by getting rid of the > per-call site trace names, especially those that get called from > multiple places. Not to mention tracing an ACTIVE, idle iclog can > result in zero state information except for the tracepoint name... > e.g: > > 384.204393: xlog_iclog_write: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004b60 caller xlog_sync > 384.204406: xlog_iclog_sync_done: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004b60 caller xlog_ioend_work > 384.204407: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 0 lsn 0x0 caller xlog_write_get_more_iclog_space > 384.204408: xlog_iclog_switch: dev 259:0 state 0x0 refcnt 1 offset 0 lsn 0x2300004ba0 caller xlog_state_get_iclog_space > 384.204418: xlog_iclog_release: dev 259:0 state 0x1 refcnt 1 offset 32256 lsn 0x2300004ba0 caller xlog_write_get_more_iclog_space > 384.204419: xlog_iclog_syncing: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_write_get_more_iclog_space > 384.204419: xlog_iclog_sync: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_state_release_iclog > 384.204422: xlog_iclog_write: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_sync > 384.204435: xlog_iclog_sync_done: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_ioend_work > > See how the xlog_iclog_get_space trace has zero state/lsn? That > indicates that it has actually found an idle, active iclog in the > correct state for it to hand out. We can also see this progression > is through xlog_write() filling mulitple iclogs. However, we can get > different progressions from different contexts such as: > > 384.193766: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 0 lsn 0x0 caller xlog_write_get_more_iclog_space > 384.193778: xlog_iclog_release: dev 259:0 state 0x0 refcnt 1 offset 27352 lsn 0x23000042b0 caller xlog_write > 384.193779: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 27352 lsn 0x23000042b0 caller xlog_write > 384.193780: xlog_iclog_wait_on: dev 259:0 state 0x0 refcnt 0 offset 0 lsn 0x0 caller xlog_cil_push_work > 384.193781: xlog_iclog_release: dev 259:0 state 0x0 refcnt 1 offset 27364 lsn 0x23000042b0 caller xlog_cil_push_work > 384.201268: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 27364 lsn 0x23000042b0 caller xlog_write > > .... the competion of a checkpoint where xlog_write() releases the > iclog and it's not a full iclog. This is followed immediately by the > commit record write which gets iclog space from a different caller > location, passes it back to xlog_cil_push_work which then releases > it. ANd then we try to wait on the previous iclog which is an idle > ACTIVE buffer, so we skip the wait. And because this isn't a > "stable CIL push", we don't see a xlog_iclog_switch() trace, and the > so iclog is not flushed to disk when it is release here. > Side note: I like how you incorporated the lack of an _iclog_switch() tracepoint in your analysis. It's almost like we'd be capable of similar reasoning wrt that _wait_on() tracepoint. ;) That aside, fair enough on the names. The reason I suggest it is I think it would be nicer to be able to do something like 'trace-cmd start -e xfs:xlog_iclog_state*' to see all the state transitions and nothing else, but that's not a huge deal and secondary to providing useful information. I do wonder a bit if we could accomplish something similar with a new state setting inline and a custom, single tracepoint to include the calling function name, but that's not something worth digging into now. > IOWs, if we don't have the individual trace names, we largely lose > all this context and then have to infer it from caller and state > changes that occur. That just makes things unnnecessarily difficult > when looking for problems in hundreds of thousands trace events.... > > > > @@ -3138,6 +3153,8 @@ xfs_log_force( > > > if (iclog->ic_state == XLOG_STATE_IOERROR) > > > goto out_error; > > > > > > + trace_xlog_iclog_force(iclog, _RET_IP_); > > > + > > > > We have a log force tracepoint a few lines up. Do we need both? > > Yes. The log force tracepoint tells us nothing about the iclog > state we are about to operate on. > Ok. Brian > > > + TP_ARGS(iclog, caller_ip), > > > + TP_STRUCT__entry( > > > + __field(dev_t, dev) > > > + __field(uint32_t, state) > > > + __field(int32_t, refcount) > > > + __field(uint32_t, offset) > > > + __field(unsigned long long, lsn) > > > + __field(unsigned long, caller_ip) > > > + ), > > > + TP_fast_assign( > > > + __entry->dev = iclog->ic_log->l_mp->m_super->s_dev; > > > + __entry->state = iclog->ic_state; > > > + __entry->refcount = atomic_read(&iclog->ic_refcnt); > > > + __entry->offset = iclog->ic_offset; > > > + __entry->lsn = be64_to_cpu(iclog->ic_header.h_lsn); > > > + __entry->caller_ip = caller_ip; > > > + ), > > > + TP_printk("dev %d:%d state 0x%x refcnt %d offset %u lsn 0x%llx caller %pS", > > > + MAJOR(__entry->dev), MINOR(__entry->dev), > > > + __entry->state, > > > > It might be nice to see a string representation of the state like we do > > for other things like log item type, etc. > > Yup, I thought about that after I sent when looking at other code. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL 2021-06-15 6:46 xfs: fix CIL push hang in for-next tree Dave Chinner 2021-06-15 6:46 ` [PATCH 1/2] xfs: add iclog state trace events Dave Chinner @ 2021-06-15 6:46 ` Dave Chinner 2021-06-15 15:38 ` Brian Foster 2021-06-15 16:19 ` Darrick J. Wong 2021-06-15 17:57 ` xfs: fix CIL push hang in for-next tree Darrick J. Wong 2 siblings, 2 replies; 14+ messages in thread From: Dave Chinner @ 2021-06-15 6:46 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> The iclogbuf ring attached to the struct xlog is circular, hence the first and last iclogs in the ring can only be determined by comparing them against the log->l_iclog pointer. In xfs_cil_push_work(), we want to wait on previous iclogs that were issued so that we can flush them to stable storage with the commit record write, and it simply waits on the previous iclog in the ring. This, however, leads to CIL push hangs in generic/019 like so: task:kworker/u33:0 state:D stack:12680 pid: 7 ppid: 2 flags:0x00004000 Workqueue: xfs-cil/pmem1 xlog_cil_push_work Call Trace: __schedule+0x30b/0x9f0 schedule+0x68/0xe0 xlog_wait_on_iclog+0x121/0x190 ? wake_up_q+0xa0/0xa0 xlog_cil_push_work+0x994/0xa10 ? _raw_spin_lock+0x15/0x20 ? xfs_swap_extents+0x920/0x920 process_one_work+0x1ab/0x390 worker_thread+0x56/0x3d0 ? rescuer_thread+0x3c0/0x3c0 kthread+0x14d/0x170 ? __kthread_bind_mask+0x70/0x70 ret_from_fork+0x1f/0x30 With other threads blocking in either xlog_state_get_iclog_space() waiting for iclog space or xlog_grant_head_wait() waiting for log reservation space. The problem here is that the previous iclog on the ring might actually be a future iclog. That is, if log->l_iclog points at commit_iclog, commit_iclog is the first (oldest) iclog in the ring and there are no previous iclogs pending as they have all completed their IO and been activated again. IOWs, commit_iclog->ic_prev points to an iclog that will be written in the future, not one that has been written in the past. Hence, in this case, waiting on the ->ic_prev iclog is incorrect behaviour, and depending on the state of the future iclog, we can end up with a circular ABA wait cycle and we hang. Fix this by only waiting on the previous iclog when the commit_iclog is not the oldest iclog in the ring. Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO") Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log_cil.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 705619e9dab4..398f00cf9cbf 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -1075,15 +1075,21 @@ xlog_cil_push_work( ticket = ctx->ticket; /* - * If the checkpoint spans multiple iclogs, wait for all previous - * iclogs to complete before we submit the commit_iclog. In this case, - * the commit_iclog write needs to issue a pre-flush so that the - * ordering is correctly preserved down to stable storage. + * If the checkpoint spans multiple iclogs, wait for all previous iclogs + * to complete before we submit the commit_iclog. If the commit iclog is + * at the head of the iclog ring, then all other iclogs have completed + * and are waiting on this one and hence we don't need to wait. + * + * Regardless of whether we need to wait or not, the the commit_iclog + * write needs to issue a pre-flush so that the ordering for this + * checkpoint is correctly preserved down to stable storage. */ spin_lock(&log->l_icloglock); if (ctx->start_lsn != commit_lsn) { - xlog_wait_on_iclog(commit_iclog->ic_prev); - spin_lock(&log->l_icloglock); + if (commit_iclog != log->l_iclog) { + xlog_wait_on_iclog(commit_iclog->ic_prev); + spin_lock(&log->l_icloglock); + } commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; } -- 2.31.1 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL 2021-06-15 6:46 ` [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL Dave Chinner @ 2021-06-15 15:38 ` Brian Foster 2021-06-15 22:09 ` Dave Chinner 2021-06-15 16:19 ` Darrick J. Wong 1 sibling, 1 reply; 14+ messages in thread From: Brian Foster @ 2021-06-15 15:38 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The iclogbuf ring attached to the struct xlog is circular, hence the > first and last iclogs in the ring can only be determined by > comparing them against the log->l_iclog pointer. > > In xfs_cil_push_work(), we want to wait on previous iclogs that were > issued so that we can flush them to stable storage with the commit > record write, and it simply waits on the previous iclog in the ring. > This, however, leads to CIL push hangs in generic/019 like so: > > task:kworker/u33:0 state:D stack:12680 pid: 7 ppid: 2 flags:0x00004000 > Workqueue: xfs-cil/pmem1 xlog_cil_push_work > Call Trace: > __schedule+0x30b/0x9f0 > schedule+0x68/0xe0 > xlog_wait_on_iclog+0x121/0x190 > ? wake_up_q+0xa0/0xa0 > xlog_cil_push_work+0x994/0xa10 > ? _raw_spin_lock+0x15/0x20 > ? xfs_swap_extents+0x920/0x920 > process_one_work+0x1ab/0x390 > worker_thread+0x56/0x3d0 > ? rescuer_thread+0x3c0/0x3c0 > kthread+0x14d/0x170 > ? __kthread_bind_mask+0x70/0x70 > ret_from_fork+0x1f/0x30 > > With other threads blocking in either xlog_state_get_iclog_space() > waiting for iclog space or xlog_grant_head_wait() waiting for log > reservation space. > > The problem here is that the previous iclog on the ring might > actually be a future iclog. That is, if log->l_iclog points at > commit_iclog, commit_iclog is the first (oldest) iclog in the ring > and there are no previous iclogs pending as they have all completed > their IO and been activated again. IOWs, commit_iclog->ic_prev > points to an iclog that will be written in the future, not one that > has been written in the past. > > Hence, in this case, waiting on the ->ic_prev iclog is incorrect > behaviour, and depending on the state of the future iclog, we can > end up with a circular ABA wait cycle and we hang. > > Fix this by only waiting on the previous iclog when the commit_iclog > is not the oldest iclog in the ring. > > Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO") > Reported-by: Brian Foster <bfoster@redhat.com> > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log_cil.c | 18 ++++++++++++------ > 1 file changed, 12 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 705619e9dab4..398f00cf9cbf 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -1075,15 +1075,21 @@ xlog_cil_push_work( > ticket = ctx->ticket; > > /* > - * If the checkpoint spans multiple iclogs, wait for all previous > - * iclogs to complete before we submit the commit_iclog. In this case, > - * the commit_iclog write needs to issue a pre-flush so that the > - * ordering is correctly preserved down to stable storage. > + * If the checkpoint spans multiple iclogs, wait for all previous iclogs > + * to complete before we submit the commit_iclog. If the commit iclog is > + * at the head of the iclog ring, then all other iclogs have completed > + * and are waiting on this one and hence we don't need to wait. > + * > + * Regardless of whether we need to wait or not, the the commit_iclog > + * write needs to issue a pre-flush so that the ordering for this > + * checkpoint is correctly preserved down to stable storage. > */ > spin_lock(&log->l_icloglock); > if (ctx->start_lsn != commit_lsn) { > - xlog_wait_on_iclog(commit_iclog->ic_prev); > - spin_lock(&log->l_icloglock); > + if (commit_iclog != log->l_iclog) { > + xlog_wait_on_iclog(commit_iclog->ic_prev); > + spin_lock(&log->l_icloglock); > + } I'm confused at what this is attempting to accomplish. If we have a single CIL force and that happens to span several iclogs, isn't the goal to wait on the previous iclog to ensure the previously written content of the current checkpoint is on-disk? If so, it looks to me that commit_iclog will always match log->l_iclog in that scenario because in the commit record path we don't (potentially) switch it out until toward the end of this function. Hmm.. did you mean to check ->ic_prev against ->l_iclog perhaps, or am I just missing something else? Brian > commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; > } > > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL 2021-06-15 15:38 ` Brian Foster @ 2021-06-15 22:09 ` Dave Chinner 2021-06-16 14:57 ` Brian Foster 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2021-06-15 22:09 UTC (permalink / raw) To: Brian Foster, Darrick J. Wong; +Cc: linux-xfs On Tue, Jun 15, 2021 at 11:38:14AM -0400, Brian Foster wrote: > On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > The iclogbuf ring attached to the struct xlog is circular, hence the > > first and last iclogs in the ring can only be determined by > > comparing them against the log->l_iclog pointer. > > > > In xfs_cil_push_work(), we want to wait on previous iclogs that were > > issued so that we can flush them to stable storage with the commit > > record write, and it simply waits on the previous iclog in the ring. > > This, however, leads to CIL push hangs in generic/019 like so: > > > > task:kworker/u33:0 state:D stack:12680 pid: 7 ppid: 2 flags:0x00004000 > > Workqueue: xfs-cil/pmem1 xlog_cil_push_work > > Call Trace: > > __schedule+0x30b/0x9f0 > > schedule+0x68/0xe0 > > xlog_wait_on_iclog+0x121/0x190 > > ? wake_up_q+0xa0/0xa0 > > xlog_cil_push_work+0x994/0xa10 > > ? _raw_spin_lock+0x15/0x20 > > ? xfs_swap_extents+0x920/0x920 > > process_one_work+0x1ab/0x390 > > worker_thread+0x56/0x3d0 > > ? rescuer_thread+0x3c0/0x3c0 > > kthread+0x14d/0x170 > > ? __kthread_bind_mask+0x70/0x70 > > ret_from_fork+0x1f/0x30 > > > > With other threads blocking in either xlog_state_get_iclog_space() > > waiting for iclog space or xlog_grant_head_wait() waiting for log > > reservation space. > > > > The problem here is that the previous iclog on the ring might > > actually be a future iclog. That is, if log->l_iclog points at > > commit_iclog, commit_iclog is the first (oldest) iclog in the ring > > and there are no previous iclogs pending as they have all completed > > their IO and been activated again. IOWs, commit_iclog->ic_prev > > points to an iclog that will be written in the future, not one that > > has been written in the past. > > > > Hence, in this case, waiting on the ->ic_prev iclog is incorrect > > behaviour, and depending on the state of the future iclog, we can > > end up with a circular ABA wait cycle and we hang. > > > > Fix this by only waiting on the previous iclog when the commit_iclog > > is not the oldest iclog in the ring. > > > > Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO") > > Reported-by: Brian Foster <bfoster@redhat.com> > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_log_cil.c | 18 ++++++++++++------ > > 1 file changed, 12 insertions(+), 6 deletions(-) > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > index 705619e9dab4..398f00cf9cbf 100644 > > --- a/fs/xfs/xfs_log_cil.c > > +++ b/fs/xfs/xfs_log_cil.c > > @@ -1075,15 +1075,21 @@ xlog_cil_push_work( > > ticket = ctx->ticket; > > > > /* > > - * If the checkpoint spans multiple iclogs, wait for all previous > > - * iclogs to complete before we submit the commit_iclog. In this case, > > - * the commit_iclog write needs to issue a pre-flush so that the > > - * ordering is correctly preserved down to stable storage. > > + * If the checkpoint spans multiple iclogs, wait for all previous iclogs > > + * to complete before we submit the commit_iclog. If the commit iclog is > > + * at the head of the iclog ring, then all other iclogs have completed > > + * and are waiting on this one and hence we don't need to wait. > > + * > > + * Regardless of whether we need to wait or not, the the commit_iclog > > + * write needs to issue a pre-flush so that the ordering for this > > + * checkpoint is correctly preserved down to stable storage. > > */ > > spin_lock(&log->l_icloglock); > > if (ctx->start_lsn != commit_lsn) { > > - xlog_wait_on_iclog(commit_iclog->ic_prev); > > - spin_lock(&log->l_icloglock); > > + if (commit_iclog != log->l_iclog) { > > + xlog_wait_on_iclog(commit_iclog->ic_prev); > > + spin_lock(&log->l_icloglock); > > + } Brian says: > I'm confused at what this is attempting to accomplish. If we have a > single CIL force and that happens to span several iclogs, isn't the goal > to wait on the previous iclog to ensure the previously written content > of the current checkpoint is on-disk? Yes. I wouldn't be surprised if I broke something, though, because I was only looking at situation where the log hang was occurring. I think I should have added a state check on the commit_iclog being in ACTIVE state, too, because if it is active then there are no future iclogs in flight and we should definitely wait. But that doesn't solve the problem, either, because if it is in WANT_SYNC there are future iclogs in flight (because xlog_state_get_iclog_space() has switched iclogs) and then we're in trouble because.... Darrick says: > I'm confused. How can you tell that we need to wait for > commit_iclog->ic_prev to be written out by comparing commit_iclog to > log->l_iclog? Can't you determine this by checking ic_prev for DIRTY > state? .... the problem is that we can't tell if ->ic_prev is a future or past iclog based on it's state. If it is in the past the ic_prev iclog could be in any state from WANT_SYNC to ACTIVE, depending on where the IO completion is. And if it's in the future, it could be in any state from ACTIVE to SYNC_DONE. It's the overlap in past/future states that causes the problem here. Being in DIRTY state just means that IO completion is running, callbacks are complete and we're about to run a ic_force_wait queue wakeup and activate the iclog again. When the hang occurs, the future iclogs are stuck in the DONE_SYNC state and can't progress through to the callback state, let alone dirty... Even an iclog in ACTIVE state doesn't mean it is in the past - it just means that it's ready to be written to. It could be being filled at this point in time by another checkpoint via xlog_write(), and it, at any point in time, can transition to WANT_SYNC, SYNCING, or SYNC_DONE. xlog_wait_on_iclog() will ignore ACTIVE/DIRTY iclogs, anyway, but the problem is that a future iclog can still transition at any time to WANT_SYNC or any other state indicating IO is in progress. Hence we just can't tell if it is a past or future iclog and so we can't blindly wait on it. In the common case, ic_prev is a past iclog and it won't get stuck on the commit_iclog being ACTIVE/WANT_SYNC at the head of the log->l_iclog ring in xlog_state_do_callback() and so will be processed and go back to ACTIVE state. However, in the rare case it is a future iclog, then it gets stuck waiting for IO completion of the commit_iclog to reach xlog_state_do_callback().... ---- What I was trying to take into account via the (commit_iclog != log->l_iclog) check was a measure of past/future tense. The hang always occurs where commit_iclog == log->l_iclog and all ic_next iclogs are future iclogs rather than being past iclogs. As Brian notes, this probably just broke the single commit case, but because the failure stress tests load up the log greatly, they don't exercise that case. So I went and looked at all the cases where we check ic_prev->ic_state to make decisions. There's only one place that does that: log forces. The sync transaction/log force aggregation code is the only thing that checks the state of the previous iclog, and it appears to avoid the "future iclog" problem in two ways. First, it only checks the previous iclog state when the current head iclog state is ACTIVE (i.e. no future iclogs). Secondly, even if it ic-prev is a future iclog, it only waits on the ic_write_wait queue, which means the log force gets woken when IO completion occurs and the iclog transitions to DONE_SYNC, hence it never waits for ordering of callbacks (other log force code does that). IOWs, there's only one place that actually tries to use the previous iclog state for making decisions, and it does doesn't really handle the case we need to handle here which is determining if the previous iclog is a past or future iclog. I does, however, handle the case of incorrect waiting safely. That is, what we need to wait for in the CIL push is the completion of the physical IO to be signalled, not for the callbacks associated with that iclog to run. IOWs, neat as it may be, xlog_wait_on_iclog() is not the right function to be using here. We should be waiting on the ic_write_wait queue to be woken when iclog IO completion occurs, not the ic_force_wait queue that waits on ordered completion of entire journal checkpoints. We just want to ensure that all previous iclogs are physically completed before we do the pre-flush on the commit record IO, we don't need software processing of those IOs to be complete. So, I think the right thing to do here is change this code to wait on the ic_write_wait queue for completion of the previous iclog and then we just don't care if commit_iclog->ic_prev is a past or future iclog.... -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL 2021-06-15 22:09 ` Dave Chinner @ 2021-06-16 14:57 ` Brian Foster 2021-06-16 22:20 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Brian Foster @ 2021-06-16 14:57 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs On Wed, Jun 16, 2021 at 08:09:44AM +1000, Dave Chinner wrote: > On Tue, Jun 15, 2021 at 11:38:14AM -0400, Brian Foster wrote: > > On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > The iclogbuf ring attached to the struct xlog is circular, hence the > > > first and last iclogs in the ring can only be determined by > > > comparing them against the log->l_iclog pointer. > > > > > > In xfs_cil_push_work(), we want to wait on previous iclogs that were > > > issued so that we can flush them to stable storage with the commit > > > record write, and it simply waits on the previous iclog in the ring. > > > This, however, leads to CIL push hangs in generic/019 like so: > > > > > > task:kworker/u33:0 state:D stack:12680 pid: 7 ppid: 2 flags:0x00004000 > > > Workqueue: xfs-cil/pmem1 xlog_cil_push_work > > > Call Trace: > > > __schedule+0x30b/0x9f0 > > > schedule+0x68/0xe0 > > > xlog_wait_on_iclog+0x121/0x190 > > > ? wake_up_q+0xa0/0xa0 > > > xlog_cil_push_work+0x994/0xa10 > > > ? _raw_spin_lock+0x15/0x20 > > > ? xfs_swap_extents+0x920/0x920 > > > process_one_work+0x1ab/0x390 > > > worker_thread+0x56/0x3d0 > > > ? rescuer_thread+0x3c0/0x3c0 > > > kthread+0x14d/0x170 > > > ? __kthread_bind_mask+0x70/0x70 > > > ret_from_fork+0x1f/0x30 > > > > > > With other threads blocking in either xlog_state_get_iclog_space() > > > waiting for iclog space or xlog_grant_head_wait() waiting for log > > > reservation space. > > > > > > The problem here is that the previous iclog on the ring might > > > actually be a future iclog. That is, if log->l_iclog points at > > > commit_iclog, commit_iclog is the first (oldest) iclog in the ring > > > and there are no previous iclogs pending as they have all completed > > > their IO and been activated again. IOWs, commit_iclog->ic_prev > > > points to an iclog that will be written in the future, not one that > > > has been written in the past. > > > > > > Hence, in this case, waiting on the ->ic_prev iclog is incorrect > > > behaviour, and depending on the state of the future iclog, we can > > > end up with a circular ABA wait cycle and we hang. > > > > > > Fix this by only waiting on the previous iclog when the commit_iclog > > > is not the oldest iclog in the ring. > > > > > > Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO") > > > Reported-by: Brian Foster <bfoster@redhat.com> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/xfs_log_cil.c | 18 ++++++++++++------ > > > 1 file changed, 12 insertions(+), 6 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > > index 705619e9dab4..398f00cf9cbf 100644 > > > --- a/fs/xfs/xfs_log_cil.c > > > +++ b/fs/xfs/xfs_log_cil.c > > > @@ -1075,15 +1075,21 @@ xlog_cil_push_work( > > > ticket = ctx->ticket; > > > > > > /* > > > - * If the checkpoint spans multiple iclogs, wait for all previous > > > - * iclogs to complete before we submit the commit_iclog. In this case, > > > - * the commit_iclog write needs to issue a pre-flush so that the > > > - * ordering is correctly preserved down to stable storage. > > > + * If the checkpoint spans multiple iclogs, wait for all previous iclogs > > > + * to complete before we submit the commit_iclog. If the commit iclog is > > > + * at the head of the iclog ring, then all other iclogs have completed > > > + * and are waiting on this one and hence we don't need to wait. > > > + * > > > + * Regardless of whether we need to wait or not, the the commit_iclog > > > + * write needs to issue a pre-flush so that the ordering for this > > > + * checkpoint is correctly preserved down to stable storage. > > > */ > > > spin_lock(&log->l_icloglock); > > > if (ctx->start_lsn != commit_lsn) { > > > - xlog_wait_on_iclog(commit_iclog->ic_prev); > > > - spin_lock(&log->l_icloglock); > > > + if (commit_iclog != log->l_iclog) { > > > + xlog_wait_on_iclog(commit_iclog->ic_prev); > > > + spin_lock(&log->l_icloglock); > > > + } > > Brian says: > > > I'm confused at what this is attempting to accomplish. If we have a > > single CIL force and that happens to span several iclogs, isn't the goal > > to wait on the previous iclog to ensure the previously written content > > of the current checkpoint is on-disk? > > Yes. I wouldn't be surprised if I broke something, though, because I > was only looking at situation where the log hang was occurring. > > I think I should have added a state check on the commit_iclog being > in ACTIVE state, too, because if it is active then there are no > future iclogs in flight and we should definitely wait. But that > doesn't solve the problem, either, because if it is in WANT_SYNC > there are future iclogs in flight (because > xlog_state_get_iclog_space() has switched iclogs) and then we're in > trouble because.... > > > Darrick says: > > > I'm confused. How can you tell that we need to wait for > > commit_iclog->ic_prev to be written out by comparing commit_iclog to > > log->l_iclog? Can't you determine this by checking ic_prev for DIRTY > > state? > > .... the problem is that we can't tell if ->ic_prev is a future or > past iclog based on it's state. If it is in the past the ic_prev > iclog could be in any state from WANT_SYNC to ACTIVE, depending on > where the IO completion is. And if it's in the future, it could be > in any state from ACTIVE to SYNC_DONE. It's the overlap in > past/future states that causes the problem here. > > Being in DIRTY state just means that IO completion is running, > callbacks are complete and we're about to run a ic_force_wait queue > wakeup and activate the iclog again. When the hang occurs, the > future iclogs are stuck in the DONE_SYNC state and can't progress > through to the callback state, let alone dirty... > > Even an iclog in ACTIVE state doesn't mean it is in the past - it > just means that it's ready to be written to. It could be being > filled at this point in time by another checkpoint via xlog_write(), > and it, at any point in time, can transition to WANT_SYNC, SYNCING, > or SYNC_DONE. > > xlog_wait_on_iclog() will ignore ACTIVE/DIRTY iclogs, anyway, but > the problem is that a future iclog can still transition at any time > to WANT_SYNC or any other state indicating IO is in progress. Hence > we just can't tell if it is a past or future iclog and so we can't > blindly wait on it. In the common case, ic_prev is a past iclog and > it won't get stuck on the commit_iclog being ACTIVE/WANT_SYNC at the > head of the log->l_iclog ring in xlog_state_do_callback() and so > will be processed and go back to ACTIVE state. However, in the rare > case it is a future iclog, then it gets stuck waiting for IO > completion of the commit_iclog to reach xlog_state_do_callback().... > > ---- > > What I was trying to take into account via the (commit_iclog != > log->l_iclog) check was a measure of past/future tense. The hang > always occurs where commit_iclog == log->l_iclog and all ic_next > iclogs are future iclogs rather than being past iclogs. As Brian > notes, this probably just broke the single commit case, but because > the failure stress tests load up the log greatly, they don't > exercise that case. > That's kind of why I was wondering if you meant to check the state of ->ic_prev against ->l_iclog, but thinking about it again I'm not sure that is valid either because I don't think anything prevents an iclog from landing at the head before it might have completed pending I/O. Maybe that doesn't make sense, I'm still wrapping my head around the whole multi-push CIL model the original patch creates.. > So I went and looked at all the cases where we check > ic_prev->ic_state to make decisions. There's only one place that > does that: log forces. The sync transaction/log force aggregation > code is the only thing that checks the state of the previous iclog, > and it appears to avoid the "future iclog" problem in two ways. > First, it only checks the previous iclog state when the current head > iclog state is ACTIVE (i.e. no future iclogs). Secondly, even if it > ic-prev is a future iclog, it only waits on the ic_write_wait queue, > which means the log force gets woken when IO completion occurs and > the iclog transitions to DONE_SYNC, hence it never waits for > ordering of callbacks (other log force code does that). > > IOWs, there's only one place that actually tries to use the previous > iclog state for making decisions, and it does doesn't really handle > the case we need to handle here which is determining if the previous > iclog is a past or future iclog. I does, however, handle the case of > incorrect waiting safely. > > That is, what we need to wait for in the CIL push is the completion > of the physical IO to be signalled, not for the callbacks associated > with that iclog to run. IOWs, neat as it may be, > xlog_wait_on_iclog() is not the right function to be using here. We > should be waiting on the ic_write_wait queue to be woken when iclog > IO completion occurs, not the ic_force_wait queue that waits on > ordered completion of entire journal checkpoints. We just want to > ensure that all previous iclogs are physically completed before we > do the pre-flush on the commit record IO, we don't need software > processing of those IOs to be complete. > Ok, but doesn't that leave a gap in scenarios where iclog completion occurs out of order? I thought the reason for waiting on the force of the previous iclog was to encapsulate that the entire content of the checkpoint has been written to disk, not just the content in the immediately previous iclog. If the issue is past/future tense, is there any reason we can't filter against the record header lsn of the previous iclog? E.g., assuming the iclog hasn't been cleaned/reactivated (i.e. h_lsn != 0), then only block on the previous iclog if XFS_LSN_CMP(prev_lsn, commit_lsn) <= 0? Hmm.. that also has me wondering if it's possible that the previous iclog might be some other future checkpoint, but then skipping the wait entirely means the checkpoint might not actually be on-disk [1]. I wonder if we need to do something like track the last lsn and/or iclog touched by the current checkpoint and make sure we wait on that, regardless of where it is in the ring (but as above, I'm still wrapping my head around all of this). Brian [1] Once we work out the functional behavior one way or another, how do we actually test/confirm that the flush behavior works as expected? I feel as though the discussions going all the way to the initial review feedback on this change have provided constant reminders that things like the I/O flush / ordering dependencies are very easy to get wrong. If that is ever the case, it's not clear to me we have any way to detect it. Not that I'd expect to have some explicit 5 second fstest for that problem, but have we analyzed whether the few decent crash recovery tests we do have are sensitive enough to fail? I wonder if we could come up with something clever and more explicit via error injection.. > So, I think the right thing to do here is change this code to wait > on the ic_write_wait queue for completion of the previous iclog and > then we just don't care if commit_iclog->ic_prev is a past or future > iclog.... > > > > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL 2021-06-16 14:57 ` Brian Foster @ 2021-06-16 22:20 ` Dave Chinner 2021-06-17 13:15 ` Brian Foster 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2021-06-16 22:20 UTC (permalink / raw) To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs On Wed, Jun 16, 2021 at 10:57:29AM -0400, Brian Foster wrote: > On Wed, Jun 16, 2021 at 08:09:44AM +1000, Dave Chinner wrote: > > On Tue, Jun 15, 2021 at 11:38:14AM -0400, Brian Foster wrote: > > > On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote: > > So I went and looked at all the cases where we check > > ic_prev->ic_state to make decisions. There's only one place that > > does that: log forces. The sync transaction/log force aggregation > > code is the only thing that checks the state of the previous iclog, > > and it appears to avoid the "future iclog" problem in two ways. > > First, it only checks the previous iclog state when the current head > > iclog state is ACTIVE (i.e. no future iclogs). Secondly, even if it > > ic-prev is a future iclog, it only waits on the ic_write_wait queue, > > which means the log force gets woken when IO completion occurs and > > the iclog transitions to DONE_SYNC, hence it never waits for > > ordering of callbacks (other log force code does that). > > > > IOWs, there's only one place that actually tries to use the previous > > iclog state for making decisions, and it does doesn't really handle > > the case we need to handle here which is determining if the previous > > iclog is a past or future iclog. I does, however, handle the case of > > incorrect waiting safely. > > > > That is, what we need to wait for in the CIL push is the completion > > of the physical IO to be signalled, not for the callbacks associated > > with that iclog to run. IOWs, neat as it may be, > > xlog_wait_on_iclog() is not the right function to be using here. We > > should be waiting on the ic_write_wait queue to be woken when iclog > > IO completion occurs, not the ic_force_wait queue that waits on > > ordered completion of entire journal checkpoints. We just want to > > ensure that all previous iclogs are physically completed before we > > do the pre-flush on the commit record IO, we don't need software > > processing of those IOs to be complete. > > > > Ok, but doesn't that leave a gap in scenarios where iclog completion > occurs out of order? I thought the reason for waiting on the force of > the previous iclog was to encapsulate that the entire content of the > checkpoint has been written to disk, not just the content in the > immediately previous iclog. Yes. We either have to wait for all the iclogs to be written or we have to use ic_force_wait for sequencing. > If the issue is past/future tense, is there any reason we can't filter > against the record header lsn of the previous iclog? E.g., assuming the > iclog hasn't been cleaned/reactivated (i.e. h_lsn != 0), then only block > on the previous iclog if XFS_LSN_CMP(prev_lsn, commit_lsn) <= 0? That's what the patch I left running overnight does: /* * If the checkpoint spans multiple iclogs, wait for all previous iclogs * to complete before we submit the commit_iclog. We can't use state * checks for this - ACTIVE can be either a past completed iclog or a * future iclog being filled, while WANT_SYNC through SYNC_DONE can be a * past or future iclog awaiting IO or ordered IO completion to be run. * In the latter case, if it's a future iclog and we wait on it, the we * will hang because it won't get processed through to ic_force_wait * wakeup until this commit_iclog is written to disk. Hence we use the * iclog header lsn and compare it to the commit lsn to determine if we * need to wait on iclogs or not. */ spin_lock(&log->l_icloglock); if (ctx->start_lsn != ctx->commit_lsn) { struct xlog_in_core *iclog; for (iclog = commit_iclog->ic_prev; iclog != commit_iclog; iclog = iclog->ic_prev) { xfs_lsn_t hlsn; /* * If the LSN of the iclog is zero or in the future it * means it has passed through IO completion and * activation and hence all previous iclogs have also * done so. We do not need to wait at all in this case. */ hlsn = be64_to_cpu(iclog->ic_header.h_lsn); if (!hlsn || XFS_LSN_CMP(hlsn, ctx->commit_lsn) > 0) break; /* * If the LSN of the iclog is older than the commit lsn, * we have to wait on it. Waiting on this via the * ic_force_wait should also order the completion of all * older iclogs, too, but we leave checking that to the * next loop iteration. */ ASSERT(XFS_LSN_CMP(hlsn, ctx->commit_lsn) < 0); xlog_wait_on_iclog(iclog); spin_lock(&log->l_icloglock); } /* * Regardless of whether we need to wait or not, the the * commit_iclog write needs to issue a pre-flush so that the * ordering for this checkpoint is correctly preserved down to * stable storage. */ commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; } > Hmm.. that also has me wondering if it's possible that the previous > iclog might be some other future checkpoint, but then skipping the wait > entirely means the checkpoint might not actually be on-disk [1]. I If the previous iclog is a future iclog, then all iclogs are future iclogs and this iclog is the oldest. The iclog ring only ages in one direction.... > wonder if we need to do something like track the last lsn and/or iclog > touched by the current checkpoint and make sure we wait on that, > regardless of where it is in the ring (but as above, I'm still wrapping > my head around all of this). We have to guarantee the log is completely stable up to the commit record, regardless of which checkpoint the iclogs belong to. If we don't do that, then log recovery will consider that the checkpoint in the log is not complete before it finds the commit record. i.e. there will be a hole in the log where the previous cycle numbers show through between the end of the checkpoint and the commit record, and hence the head of the log will be pointed at the last fully completed checkpoint, not this one. Hence we have to ensure all prior iclogs have completed to stable storage before writing the commit record, regardless of whether the previous iclogs are part of the current checkpoint or not. > [1] Once we work out the functional behavior one way or another, how do > we actually test/confirm that the flush behavior works as expected? I > feel as though the discussions going all the way to the initial review > feedback on this change have provided constant reminders that things > like the I/O flush / ordering dependencies are very easy to get wrong. > If that is ever the case, it's not clear to me we have any way to detect > it. Not that I'd expect to have some explicit 5 second fstest for that > problem, but have we analyzed whether the few decent crash recovery > tests we do have are sensitive enough to fail? I wonder if we could come > up with something clever and more explicit via error injection.. Given that we've found these problems soon after the code was integrated and made available for wider testing, I think it shows that, at the integration level, we have sufficiently sensitive crash and crash recovery tests to discover issues like this. It's not surprising that we may not find things like this at the individual developer level because they only have so much time and resources at their disposal. IOWs, I think the process is working exactly the way it should be working. It is expected that complex, subtle problems like these will be found (if they exist) once more people are actively using and testing the code in a variety of different environments. THis is exactly the function that integration testing is supposed to provide... So while it's possible that we could have more reliable test to exercise this stuff (no real idea how, though), the way we are developing and merging and finding much harder to trigger problems once the code is available for wider testing shows that our engineering process is working as it should be.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL 2021-06-16 22:20 ` Dave Chinner @ 2021-06-17 13:15 ` Brian Foster 0 siblings, 0 replies; 14+ messages in thread From: Brian Foster @ 2021-06-17 13:15 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs On Thu, Jun 17, 2021 at 08:20:54AM +1000, Dave Chinner wrote: > On Wed, Jun 16, 2021 at 10:57:29AM -0400, Brian Foster wrote: > > On Wed, Jun 16, 2021 at 08:09:44AM +1000, Dave Chinner wrote: > > > On Tue, Jun 15, 2021 at 11:38:14AM -0400, Brian Foster wrote: > > > > On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote: > > > So I went and looked at all the cases where we check > > > ic_prev->ic_state to make decisions. There's only one place that > > > does that: log forces. The sync transaction/log force aggregation > > > code is the only thing that checks the state of the previous iclog, > > > and it appears to avoid the "future iclog" problem in two ways. > > > First, it only checks the previous iclog state when the current head > > > iclog state is ACTIVE (i.e. no future iclogs). Secondly, even if it > > > ic-prev is a future iclog, it only waits on the ic_write_wait queue, > > > which means the log force gets woken when IO completion occurs and > > > the iclog transitions to DONE_SYNC, hence it never waits for > > > ordering of callbacks (other log force code does that). > > > > > > IOWs, there's only one place that actually tries to use the previous > > > iclog state for making decisions, and it does doesn't really handle > > > the case we need to handle here which is determining if the previous > > > iclog is a past or future iclog. I does, however, handle the case of > > > incorrect waiting safely. > > > > > > That is, what we need to wait for in the CIL push is the completion > > > of the physical IO to be signalled, not for the callbacks associated > > > with that iclog to run. IOWs, neat as it may be, > > > xlog_wait_on_iclog() is not the right function to be using here. We > > > should be waiting on the ic_write_wait queue to be woken when iclog > > > IO completion occurs, not the ic_force_wait queue that waits on > > > ordered completion of entire journal checkpoints. We just want to > > > ensure that all previous iclogs are physically completed before we > > > do the pre-flush on the commit record IO, we don't need software > > > processing of those IOs to be complete. > > > > > > > Ok, but doesn't that leave a gap in scenarios where iclog completion > > occurs out of order? I thought the reason for waiting on the force of > > the previous iclog was to encapsulate that the entire content of the > > checkpoint has been written to disk, not just the content in the > > immediately previous iclog. > > Yes. We either have to wait for all the iclogs to be written or > we have to use ic_force_wait for sequencing. > Right, Ok. That wasn't stated clearly in your earlier comment. > > If the issue is past/future tense, is there any reason we can't filter > > against the record header lsn of the previous iclog? E.g., assuming the > > iclog hasn't been cleaned/reactivated (i.e. h_lsn != 0), then only block > > on the previous iclog if XFS_LSN_CMP(prev_lsn, commit_lsn) <= 0? > > That's what the patch I left running overnight does: > Ok. > /* > * If the checkpoint spans multiple iclogs, wait for all previous iclogs > * to complete before we submit the commit_iclog. We can't use state > * checks for this - ACTIVE can be either a past completed iclog or a > * future iclog being filled, while WANT_SYNC through SYNC_DONE can be a > * past or future iclog awaiting IO or ordered IO completion to be run. > * In the latter case, if it's a future iclog and we wait on it, the we > * will hang because it won't get processed through to ic_force_wait > * wakeup until this commit_iclog is written to disk. Hence we use the > * iclog header lsn and compare it to the commit lsn to determine if we > * need to wait on iclogs or not. > */ > spin_lock(&log->l_icloglock); > if (ctx->start_lsn != ctx->commit_lsn) { > struct xlog_in_core *iclog; > > for (iclog = commit_iclog->ic_prev; > iclog != commit_iclog; > iclog = iclog->ic_prev) { > xfs_lsn_t hlsn; > > /* > * If the LSN of the iclog is zero or in the future it > * means it has passed through IO completion and > * activation and hence all previous iclogs have also > * done so. We do not need to wait at all in this case. > */ > hlsn = be64_to_cpu(iclog->ic_header.h_lsn); > if (!hlsn || XFS_LSN_CMP(hlsn, ctx->commit_lsn) > 0) > break; > > /* > * If the LSN of the iclog is older than the commit lsn, > * we have to wait on it. Waiting on this via the > * ic_force_wait should also order the completion of all > * older iclogs, too, but we leave checking that to the > * next loop iteration. > */ What is the point of the loop? ISTM the functional change only requires some minimal tweaking from the first patch you posted. > ASSERT(XFS_LSN_CMP(hlsn, ctx->commit_lsn) < 0); > xlog_wait_on_iclog(iclog); > spin_lock(&log->l_icloglock); > } > > /* > * Regardless of whether we need to wait or not, the the > * commit_iclog write needs to issue a pre-flush so that the > * ordering for this checkpoint is correctly preserved down to > * stable storage. > */ > commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; > } > > > > Hmm.. that also has me wondering if it's possible that the previous > > iclog might be some other future checkpoint, but then skipping the wait > > entirely means the checkpoint might not actually be on-disk [1]. I > > If the previous iclog is a future iclog, then all iclogs are future > iclogs and this iclog is the oldest. The iclog ring only ages in one > direction.... > Hm, yes that makes sense. I might have been confusing the new push concurrency with the fact that all log writers still must contend over the head iclog. > > wonder if we need to do something like track the last lsn and/or iclog > > touched by the current checkpoint and make sure we wait on that, > > regardless of where it is in the ring (but as above, I'm still wrapping > > my head around all of this). > > We have to guarantee the log is completely stable up to the commit > record, regardless of which checkpoint the iclogs belong to. If we > don't do that, then log recovery will consider that the checkpoint > in the log is not complete before it finds the commit record. i.e. > there will be a hole in the log where the previous cycle numbers > show through between the end of the checkpoint and the commit > record, and hence the head of the log will be pointed at the last > fully completed checkpoint, not this one. > Yet another thus far undocumented side effect of concurrent pushes.. Yes, I know that recovery has always stamped over incomplete record blocks out beyond the prospective head block. I'm referring to the fact that recovery clearly has some historical warts that conflict with this change in behavior to the runtime logging code and suggesting that the upstream merge cycle should not be the avenue to identify and resolve these problems. It's rather evident to me at this point that this change should be deferred, tested much more thoroughly (not because it wasn't tested or anything of that nature, but because our log/crash recovery testing is historically lacking and crude) and reposted with these issues clearly documented and addressed, at the very least in a manner that allows for proper consideration of potential alternatives, etc., as opposed to racing with an arbitrary upstream release deadline. (This is of course out of my hands as it's not my tree, not my call, etc. Just my opinion. I've thrown your latest series into my testbed regardless.) > Hence we have to ensure all prior iclogs have completed to stable > storage before writing the commit record, regardless of whether the > previous iclogs are part of the current checkpoint or not. > > > [1] Once we work out the functional behavior one way or another, how do > > we actually test/confirm that the flush behavior works as expected? I > > feel as though the discussions going all the way to the initial review > > feedback on this change have provided constant reminders that things > > like the I/O flush / ordering dependencies are very easy to get wrong. > > If that is ever the case, it's not clear to me we have any way to detect > > it. Not that I'd expect to have some explicit 5 second fstest for that > > problem, but have we analyzed whether the few decent crash recovery > > tests we do have are sensitive enough to fail? I wonder if we could come > > up with something clever and more explicit via error injection.. > > Given that we've found these problems soon after the code was > integrated and made available for wider testing, I think it shows > that, at the integration level, we have sufficiently sensitive crash > and crash recovery tests to discover issues like this. It's not > surprising that we may not find things like this at the > individual developer level because they only have so much time and > resources at their disposal. > What do the bugs that were found have to do with test coverage for the updated flush semantics? I called that out separately precisely because recent patches demonstrate how the logic is subtle and easy to break, but I don't see any real coverage for it at all. So it would be nice to think about how we might improve on that. > IOWs, I think the process is working exactly the way it should be > working. It is expected that complex, subtle problems like these > will be found (if they exist) once more people are actively using > and testing the code in a variety of different environments. THis is > exactly the function that integration testing is supposed to > provide... > > So while it's possible that we could have more reliable test to > exercise this stuff (no real idea how, though), the way we are > developing and merging and finding much harder to trigger problems > once the code is available for wider testing shows that our > engineering process is working as it should be.... > Yeah, that's wishful thinking. I made a point to run some tests I've recently been hammering on (for that bli patch I still can't seem to progress) because 1.) I know we have a bunch of logging changes in the pipeline that I otherwise haven't been able to keep up with and 2.) I'm about to disappear for a few months on leave at any moment and so wanted to prioritize some testing to at least help verify we maintain correctness or otherwise get any issues reported to the list. And TBH if it weren't for 2., this is probably not something I would have prioritized over some other longer term development work in the first place. I'd like to hope that we would have caught these issues before release regardless, but let's be realistic.. if developer testing doesn't catch it, upstream review/merge cycle testing doesn't catch it, then for crash and recovery issues (no regular users are experimenting with failure and recovery of their filesystems to see if we've found new ways to lose their data), the odds go way down from there that somebody is going to catch it before it's released to the wild. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL 2021-06-15 6:46 ` [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL Dave Chinner 2021-06-15 15:38 ` Brian Foster @ 2021-06-15 16:19 ` Darrick J. Wong 1 sibling, 0 replies; 14+ messages in thread From: Darrick J. Wong @ 2021-06-15 16:19 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The iclogbuf ring attached to the struct xlog is circular, hence the > first and last iclogs in the ring can only be determined by > comparing them against the log->l_iclog pointer. > > In xfs_cil_push_work(), we want to wait on previous iclogs that were > issued so that we can flush them to stable storage with the commit > record write, and it simply waits on the previous iclog in the ring. > This, however, leads to CIL push hangs in generic/019 like so: > > task:kworker/u33:0 state:D stack:12680 pid: 7 ppid: 2 flags:0x00004000 > Workqueue: xfs-cil/pmem1 xlog_cil_push_work > Call Trace: > __schedule+0x30b/0x9f0 > schedule+0x68/0xe0 > xlog_wait_on_iclog+0x121/0x190 > ? wake_up_q+0xa0/0xa0 > xlog_cil_push_work+0x994/0xa10 > ? _raw_spin_lock+0x15/0x20 > ? xfs_swap_extents+0x920/0x920 > process_one_work+0x1ab/0x390 > worker_thread+0x56/0x3d0 > ? rescuer_thread+0x3c0/0x3c0 > kthread+0x14d/0x170 > ? __kthread_bind_mask+0x70/0x70 > ret_from_fork+0x1f/0x30 > > With other threads blocking in either xlog_state_get_iclog_space() > waiting for iclog space or xlog_grant_head_wait() waiting for log > reservation space. > > The problem here is that the previous iclog on the ring might > actually be a future iclog. That is, if log->l_iclog points at > commit_iclog, commit_iclog is the first (oldest) iclog in the ring > and there are no previous iclogs pending as they have all completed > their IO and been activated again. IOWs, commit_iclog->ic_prev > points to an iclog that will be written in the future, not one that > has been written in the past. > > Hence, in this case, waiting on the ->ic_prev iclog is incorrect > behaviour, and depending on the state of the future iclog, we can > end up with a circular ABA wait cycle and we hang. > > Fix this by only waiting on the previous iclog when the commit_iclog > is not the oldest iclog in the ring. > > Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO") > Reported-by: Brian Foster <bfoster@redhat.com> > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log_cil.c | 18 ++++++++++++------ > 1 file changed, 12 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 705619e9dab4..398f00cf9cbf 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -1075,15 +1075,21 @@ xlog_cil_push_work( > ticket = ctx->ticket; > > /* > - * If the checkpoint spans multiple iclogs, wait for all previous > - * iclogs to complete before we submit the commit_iclog. In this case, > - * the commit_iclog write needs to issue a pre-flush so that the > - * ordering is correctly preserved down to stable storage. > + * If the checkpoint spans multiple iclogs, wait for all previous iclogs > + * to complete before we submit the commit_iclog. If the commit iclog is > + * at the head of the iclog ring, then all other iclogs have completed > + * and are waiting on this one and hence we don't need to wait. > + * > + * Regardless of whether we need to wait or not, the the commit_iclog > + * write needs to issue a pre-flush so that the ordering for this > + * checkpoint is correctly preserved down to stable storage. > */ > spin_lock(&log->l_icloglock); > if (ctx->start_lsn != commit_lsn) { > - xlog_wait_on_iclog(commit_iclog->ic_prev); > - spin_lock(&log->l_icloglock); > + if (commit_iclog != log->l_iclog) { > + xlog_wait_on_iclog(commit_iclog->ic_prev); > + spin_lock(&log->l_icloglock); > + } I'm confused. How can you tell that we need to wait for commit_iclog->ic_prev to be written out by comparing commit_iclog to log->l_iclog? Can't you determine this by checking ic_prev for DIRTY state? --D > commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; > } > > -- > 2.31.1 > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: xfs: fix CIL push hang in for-next tree 2021-06-15 6:46 xfs: fix CIL push hang in for-next tree Dave Chinner 2021-06-15 6:46 ` [PATCH 1/2] xfs: add iclog state trace events Dave Chinner 2021-06-15 6:46 ` [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL Dave Chinner @ 2021-06-15 17:57 ` Darrick J. Wong 2 siblings, 0 replies; 14+ messages in thread From: Darrick J. Wong @ 2021-06-15 17:57 UTC (permalink / raw) To: Dave Chinner, Allison Henderson; +Cc: linux-xfs On Tue, Jun 15, 2021 at 04:46:56PM +1000, Dave Chinner wrote: > Hi folks, > > This is the first fix for the problems Brian has reported from > generic/019. This has fixed the hang, but the other log recovery > problem he reported is still present (seen once with these patches > in place). > > I've tested these out to a couple of hundred cycles of > continual looping generic/019 before the systems fall over with a > perag reference count underrun at unmount after a shutdown. I'm > pretty sure the hang is fixed, as it would manifest within 10-20 > cycles without this patch. > > The first patch is the iclogbuf state tracing I used to capture the > iclogbuf wrapping state. The second patch is the fix. I found another bug while testing for-next. If I run generic/100 more than about ~30 times with a 1k block size: FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 flax-mtr00 5.13.0-rc4-djwx #rc4 SMP PREEMPT Mon Jun 7 11:17:23 PDT 2021 MKFS_OPTIONS -- -f -b size=1024, /dev/sdf MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdf /opt I see this in dmesg: run fstests generic/100 at 2021-06-15 10:41:45 XFS (sda): ctx ticket reservation ran out. Need to up reservation XFS (sda): ticket reservation summary: XFS (sda): unit res = 47168 bytes XFS (sda): current res = -404 bytes XFS (sda): original count = 1 XFS (sda): remaining count = 1 XFS (sda): xfs_do_force_shutdown(0x2) called from line 2440 of file fs/xfs/xfs_log.c. Return address = xlog_write+0x608/0x640 [xfs] XFS (sda): Log I/O Error Detected. Shutting down filesystem XFS (sda): Please unmount the filesystem and rectify the problem(s) XFS (sda): Unmounting Filesystem Looking up that line in gdb produces: 0xffffffffa038a0a8 is in xlog_write (fs/xfs/xfs_log.c:2439). 2434 int log_offset; 2435 2436 if (ticket->t_curr_res < 0) { 2437 xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES, 2438 "ctx ticket reservation ran out. Need to up reservation"); 2439 xlog_print_tic_res(log->l_mp, ticket); 2440 xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR); 2441 } I haven't applied these two patches yet, but looking back through fstests reports I never saw this before the recent for-next push. I'm uncertain if it's the CIL work or the xattr refactoring that did this, though AFAICT generic/100 itself does not generate any xattrs and I don't have any LSMs enabled that would cause them to be created. --D > > Cheers, > > Dave. > > ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2021-06-17 13:15 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-15 6:46 xfs: fix CIL push hang in for-next tree Dave Chinner 2021-06-15 6:46 ` [PATCH 1/2] xfs: add iclog state trace events Dave Chinner 2021-06-15 15:26 ` Darrick J. Wong 2021-06-15 15:37 ` Brian Foster 2021-06-15 22:47 ` Dave Chinner 2021-06-16 14:55 ` Brian Foster 2021-06-15 6:46 ` [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL Dave Chinner 2021-06-15 15:38 ` Brian Foster 2021-06-15 22:09 ` Dave Chinner 2021-06-16 14:57 ` Brian Foster 2021-06-16 22:20 ` Dave Chinner 2021-06-17 13:15 ` Brian Foster 2021-06-15 16:19 ` Darrick J. Wong 2021-06-15 17:57 ` xfs: fix CIL push hang in for-next tree Darrick J. Wong
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.