* [PATCH v2 0/2] xfs: limit CIL push sizes @ 2019-09-30 6:03 Dave Chinner 2019-09-30 6:03 ` [PATCH 1/2] xfs: Lower CIL flush limit for large logs Dave Chinner 2019-09-30 6:03 ` [PATCH 2/2] xfs: Throttle commits on delayed background CIL push Dave Chinner 0 siblings, 2 replies; 23+ messages in thread From: Dave Chinner @ 2019-09-30 6:03 UTC (permalink / raw) To: linux-xfs HI Folks, Version 2 of the CIL push size limiting patches. The main changes in this version are updates to comments describing behaviour, making it clear this isn't a hard limit but a method of providing schedule points that will allow the CIL push to proceed rather than being held off indefinitely by ongoing work. The original patchset was here: https://lore.kernel.org/linux-xfs/20190909015159.19662-1-david@fromorbit.com/ Cheers, Dave. ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/2] xfs: Lower CIL flush limit for large logs 2019-09-30 6:03 [PATCH v2 0/2] xfs: limit CIL push sizes Dave Chinner @ 2019-09-30 6:03 ` Dave Chinner 2019-09-30 16:55 ` Brian Foster 2019-09-30 6:03 ` [PATCH 2/2] xfs: Throttle commits on delayed background CIL push Dave Chinner 1 sibling, 1 reply; 23+ messages in thread From: Dave Chinner @ 2019-09-30 6:03 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> The current CIL size aggregation limit is 1/8th the log size. This means for large logs we might be aggregating at least 250MB of dirty objects in memory before the CIL is flushed to the journal. With CIL shadow buffers sitting around, this means the CIL is often consuming >500MB of temporary memory that is all allocated under GFP_NOFS conditions. Flushing the CIL can take some time to do if there is other IO ongoing, and can introduce substantial log force latency by itself. It also pins the memory until the objects are in the AIL and can be written back and reclaimed by shrinkers. Hence this threshold also tends to determine the minimum amount of memory XFS can operate in under heavy modification without triggering the OOM killer. Modify the CIL space limit to prevent such huge amounts of pinned metadata from aggregating. We can have 2MB of log IO in flight at once, so limit aggregation to 16x this size. This threshold was chosen as it little impact on performance (on 16-way fsmark) or log traffic but pins a lot less memory on large logs especially under heavy memory pressure. An aggregation limit of 8x had 5-10% performance degradation and a 50% increase in log throughput for the same workload, so clearly that was too small for highly concurrent workloads on large logs. This was found via trace analysis of AIL behaviour. e.g. insertion from a single CIL flush: xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL $ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l 1721823 $ So there were 1.7 million objects inserted into the AIL from this CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which was the end of the trace (i.e. it hadn't finished). Clearly a major problem. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log_priv.h | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index b880c23cb6e4..a3cc8a9a16d9 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -323,13 +323,30 @@ struct xfs_cil { * tries to keep 25% of the log free, so we need to keep below that limit or we * risk running out of free log space to start any new transactions. * - * In order to keep background CIL push efficient, we will set a lower - * threshold at which background pushing is attempted without blocking current - * transaction commits. A separate, higher bound defines when CIL pushes are - * enforced to ensure we stay within our maximum checkpoint size bounds. - * threshold, yet give us plenty of space for aggregation on large logs. + * In order to keep background CIL push efficient, we only need to ensure the + * CIL is large enough to maintain sufficient in-memory relogging to avoid + * repeated physical writes of frequently modified metadata. If we allow the CIL + * to grow to a substantial fraction of the log, then we may be pinning hundreds + * of megabytes of metadata in memory until the CIL flushes. This can cause + * issues when we are running low on memory - pinned memory cannot be reclaimed, + * and the CIL consumes a lot of memory. Hence we need to set an upper physical + * size limit for the CIL that limits the maximum amount of memory pinned by the + * CIL but does not limit performance by reducing relogging efficiency + * significantly. + * + * As such, the CIL push threshold ends up being the smaller of two thresholds: + * - a threshold large enough that it allows CIL to be pushed and progress to be + * made without excessive blocking of incoming transaction commits. This is + * defined to be 12.5% of the log space - half the 25% push threshold of the + * AIL. + * - small enough that it doesn't pin excessive amounts of memory but maintains + * close to peak relogging efficiency. This is defined to be 16x the iclog + * buffer window (32MB) as measurements have shown this to be roughly the + * point of diminishing performance increases under highly concurrent + * modification workloads. */ -#define XLOG_CIL_SPACE_LIMIT(log) (log->l_logsize >> 3) +#define XLOG_CIL_SPACE_LIMIT(log) \ + min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4) /* * ticket grant locks, queues and accounting have their own cachlines -- 2.23.0.rc1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 1/2] xfs: Lower CIL flush limit for large logs 2019-09-30 6:03 ` [PATCH 1/2] xfs: Lower CIL flush limit for large logs Dave Chinner @ 2019-09-30 16:55 ` Brian Foster 0 siblings, 0 replies; 23+ messages in thread From: Brian Foster @ 2019-09-30 16:55 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Mon, Sep 30, 2019 at 04:03:43PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The current CIL size aggregation limit is 1/8th the log size. This > means for large logs we might be aggregating at least 250MB of dirty objects > in memory before the CIL is flushed to the journal. With CIL shadow > buffers sitting around, this means the CIL is often consuming >500MB > of temporary memory that is all allocated under GFP_NOFS conditions. > > Flushing the CIL can take some time to do if there is other IO > ongoing, and can introduce substantial log force latency by itself. > It also pins the memory until the objects are in the AIL and can be > written back and reclaimed by shrinkers. Hence this threshold also > tends to determine the minimum amount of memory XFS can operate in > under heavy modification without triggering the OOM killer. > > Modify the CIL space limit to prevent such huge amounts of pinned > metadata from aggregating. We can have 2MB of log IO in flight at > once, so limit aggregation to 16x this size. This threshold was > chosen as it little impact on performance (on 16-way fsmark) or log > traffic but pins a lot less memory on large logs especially under > heavy memory pressure. An aggregation limit of 8x had 5-10% > performance degradation and a 50% increase in log throughput for > the same workload, so clearly that was too small for highly > concurrent workloads on large logs. > > This was found via trace analysis of AIL behaviour. e.g. insertion > from a single CIL flush: > > xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL > > $ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l > 1721823 > $ > > So there were 1.7 million objects inserted into the AIL from this > CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which > was the end of the trace (i.e. it hadn't finished). Clearly a major > problem. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- Seems reasonable: Reviewed-by: Brian Foster <bfoster@redhat.com> > fs/xfs/xfs_log_priv.h | 29 +++++++++++++++++++++++------ > 1 file changed, 23 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index b880c23cb6e4..a3cc8a9a16d9 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -323,13 +323,30 @@ struct xfs_cil { > * tries to keep 25% of the log free, so we need to keep below that limit or we > * risk running out of free log space to start any new transactions. > * > - * In order to keep background CIL push efficient, we will set a lower > - * threshold at which background pushing is attempted without blocking current > - * transaction commits. A separate, higher bound defines when CIL pushes are > - * enforced to ensure we stay within our maximum checkpoint size bounds. > - * threshold, yet give us plenty of space for aggregation on large logs. > + * In order to keep background CIL push efficient, we only need to ensure the > + * CIL is large enough to maintain sufficient in-memory relogging to avoid > + * repeated physical writes of frequently modified metadata. If we allow the CIL > + * to grow to a substantial fraction of the log, then we may be pinning hundreds > + * of megabytes of metadata in memory until the CIL flushes. This can cause > + * issues when we are running low on memory - pinned memory cannot be reclaimed, > + * and the CIL consumes a lot of memory. Hence we need to set an upper physical > + * size limit for the CIL that limits the maximum amount of memory pinned by the > + * CIL but does not limit performance by reducing relogging efficiency > + * significantly. > + * > + * As such, the CIL push threshold ends up being the smaller of two thresholds: > + * - a threshold large enough that it allows CIL to be pushed and progress to be > + * made without excessive blocking of incoming transaction commits. This is > + * defined to be 12.5% of the log space - half the 25% push threshold of the > + * AIL. > + * - small enough that it doesn't pin excessive amounts of memory but maintains > + * close to peak relogging efficiency. This is defined to be 16x the iclog > + * buffer window (32MB) as measurements have shown this to be roughly the > + * point of diminishing performance increases under highly concurrent > + * modification workloads. > */ > -#define XLOG_CIL_SPACE_LIMIT(log) (log->l_logsize >> 3) > +#define XLOG_CIL_SPACE_LIMIT(log) \ > + min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4) > > /* > * ticket grant locks, queues and accounting have their own cachlines > -- > 2.23.0.rc1 > ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-09-30 6:03 [PATCH v2 0/2] xfs: limit CIL push sizes Dave Chinner 2019-09-30 6:03 ` [PATCH 1/2] xfs: Lower CIL flush limit for large logs Dave Chinner @ 2019-09-30 6:03 ` Dave Chinner 2019-09-30 17:03 ` Brian Foster 1 sibling, 1 reply; 23+ messages in thread From: Dave Chinner @ 2019-09-30 6:03 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> In certain situations the background CIL push can be indefinitely delayed. While we have workarounds from the obvious cases now, it doesn't solve the underlying issue. This issue is that there is no upper limit on the CIL where we will either force or wait for a background push to start, hence allowing the CIL to grow without bound until it consumes all log space. To fix this, add a new wait queue to the CIL which allows background pushes to wait for the CIL context to be switched out. This happens when the push starts, so it will allow us to block incoming transaction commit completion until the push has started. This will only affect processes that are running modifications, and only when the CIL threshold has been significantly overrun. This has no apparent impact on performance, and doesn't even trigger until over 45 million inodes had been created in a 16-way fsmark test on a 2GB log. That was limiting at 64MB of log space used, so the active CIL size is only about 3% of the total log in that case. The concurrent removal of those files did not trigger the background sleep at all. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log_cil.c | 37 +++++++++++++++++++++++++++++++++---- fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++ fs/xfs/xfs_trace.h | 1 + 3 files changed, 58 insertions(+), 4 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index ef652abd112c..4a09d50e1368 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -670,6 +670,11 @@ xlog_cil_push( push_seq = cil->xc_push_seq; ASSERT(push_seq <= ctx->sequence); + /* + * Wake up any background push waiters now this context is being pushed. + */ + wake_up_all(&ctx->push_wait); + /* * Check if we've anything to push. If there is nothing, then we don't * move on to a new sequence number and so we have to be able to push @@ -746,6 +751,7 @@ xlog_cil_push( */ INIT_LIST_HEAD(&new_ctx->committing); INIT_LIST_HEAD(&new_ctx->busy_extents); + init_waitqueue_head(&new_ctx->push_wait); new_ctx->sequence = ctx->sequence + 1; new_ctx->cil = cil; cil->xc_ctx = new_ctx; @@ -900,7 +906,7 @@ xlog_cil_push_work( */ static void xlog_cil_push_background( - struct xlog *log) + struct xlog *log) __releases(cil->xc_ctx_lock) { struct xfs_cil *cil = log->l_cilp; @@ -914,14 +920,36 @@ xlog_cil_push_background( * don't do a background push if we haven't used up all the * space available yet. */ - if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) + if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) { + up_read(&cil->xc_ctx_lock); return; + } spin_lock(&cil->xc_push_lock); if (cil->xc_push_seq < cil->xc_current_sequence) { cil->xc_push_seq = cil->xc_current_sequence; queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); } + + /* + * Drop the context lock now, we can't hold that if we need to sleep + * because we are over the blocking threshold. The push_lock is still + * held, so blocking threshold sleep/wakeup is still correctly + * serialised here. + */ + up_read(&cil->xc_ctx_lock); + + /* + * If we are well over the space limit, throttle the work that is being + * done until the push work on this context has begun. + */ + if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { + trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket); + ASSERT(cil->xc_ctx->space_used < log->l_logsize); + xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock); + return; + } + spin_unlock(&cil->xc_push_lock); } @@ -1038,9 +1066,9 @@ xfs_log_commit_cil( if (lip->li_ops->iop_committing) lip->li_ops->iop_committing(lip, xc_commit_lsn); } - xlog_cil_push_background(log); - up_read(&cil->xc_ctx_lock); + /* xlog_cil_push_background() releases cil->xc_ctx_lock */ + xlog_cil_push_background(log); } /* @@ -1199,6 +1227,7 @@ xlog_cil_init( INIT_LIST_HEAD(&ctx->committing); INIT_LIST_HEAD(&ctx->busy_extents); + init_waitqueue_head(&ctx->push_wait); ctx->sequence = 1; ctx->cil = cil; cil->xc_ctx = ctx; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index a3cc8a9a16d9..f231b7dfaeab 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -247,6 +247,7 @@ struct xfs_cil_ctx { struct xfs_log_vec *lv_chain; /* logvecs being pushed */ struct list_head iclog_entry; struct list_head committing; /* ctx committing list */ + wait_queue_head_t push_wait; /* background push throttle */ struct work_struct discard_endio_work; }; @@ -344,10 +345,33 @@ struct xfs_cil { * buffer window (32MB) as measurements have shown this to be roughly the * point of diminishing performance increases under highly concurrent * modification workloads. + * + * To prevent the CIL from overflowing upper commit size bounds, we introduce a + * new threshold at which we block committing transactions until the background + * CIL commit commences and switches to a new context. While this is not a hard + * limit, it forces the process committing a transaction to the CIL to block and + * yeild the CPU, giving the CIL push work a chance to be scheduled and start + * work. This prevents a process running lots of transactions from overfilling + * the CIL because it is not yielding the CPU. We set the blocking limit at + * twice the background push space threshold so we keep in line with the AIL + * push thresholds. + * + * Note: this is not a -hard- limit as blocking is applied after the transaction + * is inserted into the CIL and the push has been triggered. It is largely a + * throttling mechanism that allows the CIL push to be scheduled and run. A hard + * limit will be difficult to implement without introducing global serialisation + * in the CIL commit fast path, and it's not at all clear that we actually need + * such hard limits given the ~7 years we've run without a hard limit before + * finding the first situation where a checkpoint size overflow actually + * occurred. Hence the simple throttle, and an ASSERT check to tell us that + * we've overrun the max size. */ #define XLOG_CIL_SPACE_LIMIT(log) \ min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4) +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log) \ + (XLOG_CIL_SPACE_LIMIT(log) * 2) + /* * ticket grant locks, queues and accounting have their own cachlines * as these are quite hot and can be operated on concurrently. diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index eaae275ed430..e7087ede2662 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -1011,6 +1011,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_sub); DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_enter); DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_exit); DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_sub); +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait); DECLARE_EVENT_CLASS(xfs_log_item_class, TP_PROTO(struct xfs_log_item *lip), -- 2.23.0.rc1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-09-30 6:03 ` [PATCH 2/2] xfs: Throttle commits on delayed background CIL push Dave Chinner @ 2019-09-30 17:03 ` Brian Foster 2019-09-30 21:53 ` Dave Chinner 0 siblings, 1 reply; 23+ messages in thread From: Brian Foster @ 2019-09-30 17:03 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Mon, Sep 30, 2019 at 04:03:44PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > In certain situations the background CIL push can be indefinitely > delayed. While we have workarounds from the obvious cases now, it > doesn't solve the underlying issue. This issue is that there is no > upper limit on the CIL where we will either force or wait for > a background push to start, hence allowing the CIL to grow without > bound until it consumes all log space. > > To fix this, add a new wait queue to the CIL which allows background > pushes to wait for the CIL context to be switched out. This happens > when the push starts, so it will allow us to block incoming > transaction commit completion until the push has started. This will > only affect processes that are running modifications, and only when > the CIL threshold has been significantly overrun. > > This has no apparent impact on performance, and doesn't even trigger > until over 45 million inodes had been created in a 16-way fsmark > test on a 2GB log. That was limiting at 64MB of log space used, so > the active CIL size is only about 3% of the total log in that case. > The concurrent removal of those files did not trigger the background > sleep at all. > Have you done similar testing for small/minimum sized logs? Also, if this is so limited in occurrence, had you given any thought to something even more simple like flushing the CIL push workqueue when over the throttle threshold? That would wait longer to fill the iclogs, but would eliminate the need for another waitqueue that's apparently only used for throttling. It may also facilitate reuse of xlog_cil_push_now() in the >= XLOG_CIL_SPACE_BLOCKING_LIMIT() case (actually, I think this could facilitate the elimination of xlog_cil_push_background() entirely). That aside... > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log_cil.c | 37 +++++++++++++++++++++++++++++++++---- > fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++ > fs/xfs/xfs_trace.h | 1 + > 3 files changed, 58 insertions(+), 4 deletions(-) > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index ef652abd112c..4a09d50e1368 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -670,6 +670,11 @@ xlog_cil_push( > push_seq = cil->xc_push_seq; > ASSERT(push_seq <= ctx->sequence); > > + /* > + * Wake up any background push waiters now this context is being pushed. > + */ > + wake_up_all(&ctx->push_wait); > + > /* > * Check if we've anything to push. If there is nothing, then we don't > * move on to a new sequence number and so we have to be able to push > @@ -746,6 +751,7 @@ xlog_cil_push( > */ > INIT_LIST_HEAD(&new_ctx->committing); > INIT_LIST_HEAD(&new_ctx->busy_extents); > + init_waitqueue_head(&new_ctx->push_wait); > new_ctx->sequence = ctx->sequence + 1; > new_ctx->cil = cil; > cil->xc_ctx = new_ctx; > @@ -900,7 +906,7 @@ xlog_cil_push_work( > */ > static void > xlog_cil_push_background( > - struct xlog *log) > + struct xlog *log) __releases(cil->xc_ctx_lock) > { > struct xfs_cil *cil = log->l_cilp; > > @@ -914,14 +920,36 @@ xlog_cil_push_background( > * don't do a background push if we haven't used up all the > * space available yet. > */ > - if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) > + if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) { > + up_read(&cil->xc_ctx_lock); > return; > + } > > spin_lock(&cil->xc_push_lock); > if (cil->xc_push_seq < cil->xc_current_sequence) { > cil->xc_push_seq = cil->xc_current_sequence; > queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); > } > + > + /* > + * Drop the context lock now, we can't hold that if we need to sleep > + * because we are over the blocking threshold. The push_lock is still > + * held, so blocking threshold sleep/wakeup is still correctly > + * serialised here. > + */ > + up_read(&cil->xc_ctx_lock); > + > + /* > + * If we are well over the space limit, throttle the work that is being > + * done until the push work on this context has begun. > + */ > + if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { Can we use consistent logic with the rest of the function? I.e. up_read(..); if (space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { spin_unlock(..); return; } ... xlog_wait(...); > + trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket); > + ASSERT(cil->xc_ctx->space_used < log->l_logsize); > + xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock); > + return; > + } > + Also, I find it slightly annoying that even with all of this locking quirkiness we still manage to read ->space_used unprotected (now twice). IMO, the simple thing would be to let xlog_cil_insert_items() return the size immediately after the current transaction inserts and pass that into this function as a parameter. > spin_unlock(&cil->xc_push_lock); > > } > @@ -1038,9 +1066,9 @@ xfs_log_commit_cil( > if (lip->li_ops->iop_committing) > lip->li_ops->iop_committing(lip, xc_commit_lsn); > } > - xlog_cil_push_background(log); > > - up_read(&cil->xc_ctx_lock); > + /* xlog_cil_push_background() releases cil->xc_ctx_lock */ > + xlog_cil_push_background(log); > } > > /* > @@ -1199,6 +1227,7 @@ xlog_cil_init( > > INIT_LIST_HEAD(&ctx->committing); > INIT_LIST_HEAD(&ctx->busy_extents); > + init_waitqueue_head(&ctx->push_wait); > ctx->sequence = 1; > ctx->cil = cil; > cil->xc_ctx = ctx; > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index a3cc8a9a16d9..f231b7dfaeab 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -247,6 +247,7 @@ struct xfs_cil_ctx { > struct xfs_log_vec *lv_chain; /* logvecs being pushed */ > struct list_head iclog_entry; > struct list_head committing; /* ctx committing list */ > + wait_queue_head_t push_wait; /* background push throttle */ > struct work_struct discard_endio_work; > }; > > @@ -344,10 +345,33 @@ struct xfs_cil { > * buffer window (32MB) as measurements have shown this to be roughly the > * point of diminishing performance increases under highly concurrent > * modification workloads. > + * > + * To prevent the CIL from overflowing upper commit size bounds, we introduce a > + * new threshold at which we block committing transactions until the background > + * CIL commit commences and switches to a new context. While this is not a hard > + * limit, it forces the process committing a transaction to the CIL to block and > + * yeild the CPU, giving the CIL push work a chance to be scheduled and start > + * work. This prevents a process running lots of transactions from overfilling > + * the CIL because it is not yielding the CPU. We set the blocking limit at > + * twice the background push space threshold so we keep in line with the AIL > + * push thresholds. > + * > + * Note: this is not a -hard- limit as blocking is applied after the transaction > + * is inserted into the CIL and the push has been triggered. It is largely a > + * throttling mechanism that allows the CIL push to be scheduled and run. A hard > + * limit will be difficult to implement without introducing global serialisation > + * in the CIL commit fast path, and it's not at all clear that we actually need > + * such hard limits given the ~7 years we've run without a hard limit before > + * finding the first situation where a checkpoint size overflow actually > + * occurred. Hence the simple throttle, and an ASSERT check to tell us that > + * we've overrun the max size. > */ I appreciate the extra documentation here, but I think most of the second paragraph is better commit log description material than something worth lengthening this already huge comment for. I'd suggest something like the following, but feel free to rework of course: * * Since the CIL push threshold only triggers a background push, a * second threshold triggers transaction commit blocking until the * background push initiates and switches the CIL context. The blocking * threshold is set to twice the background push threshold to keep in * line with AIL push thresholds. Note that this is not a CIL context * size limit. This is a throttle threshold to slow the growth of the * context and yield the CPU for a background push under overload * conditions. */ Brian > #define XLOG_CIL_SPACE_LIMIT(log) \ > min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4) > > +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log) \ > + (XLOG_CIL_SPACE_LIMIT(log) * 2) > + > /* > * ticket grant locks, queues and accounting have their own cachlines > * as these are quite hot and can be operated on concurrently. > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index eaae275ed430..e7087ede2662 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -1011,6 +1011,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_sub); > DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_enter); > DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_exit); > DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_sub); > +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait); > > DECLARE_EVENT_CLASS(xfs_log_item_class, > TP_PROTO(struct xfs_log_item *lip), > -- > 2.23.0.rc1 > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-09-30 17:03 ` Brian Foster @ 2019-09-30 21:53 ` Dave Chinner 2019-10-01 3:42 ` Dave Chinner 2019-10-01 13:13 ` Brian Foster 0 siblings, 2 replies; 23+ messages in thread From: Dave Chinner @ 2019-09-30 21:53 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote: > On Mon, Sep 30, 2019 at 04:03:44PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > In certain situations the background CIL push can be indefinitely > > delayed. While we have workarounds from the obvious cases now, it > > doesn't solve the underlying issue. This issue is that there is no > > upper limit on the CIL where we will either force or wait for > > a background push to start, hence allowing the CIL to grow without > > bound until it consumes all log space. > > > > To fix this, add a new wait queue to the CIL which allows background > > pushes to wait for the CIL context to be switched out. This happens > > when the push starts, so it will allow us to block incoming > > transaction commit completion until the push has started. This will > > only affect processes that are running modifications, and only when > > the CIL threshold has been significantly overrun. > > > > This has no apparent impact on performance, and doesn't even trigger > > until over 45 million inodes had been created in a 16-way fsmark > > test on a 2GB log. That was limiting at 64MB of log space used, so > > the active CIL size is only about 3% of the total log in that case. > > The concurrent removal of those files did not trigger the background > > sleep at all. > > > > Have you done similar testing for small/minimum sized logs? Yes. I've had the tracepoint active during xfstests runs on test filesystems using default log sizes on 5-15GB filesystems. The only test in all of xfstests that has triggered it is generic/017, and it only triggered once. e.g. # trace-cmd start -e xfs_log_cil_wait <run xfstests> # trace-cmd show # tracer: nop # # entries-in-buffer/entries-written: 1/1 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | xfs_io-2158 [001] ...1 309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560 # And the timestamp matched the time that generic/017 was running. > Also, if this is so limited in occurrence, had you given any thought to > something even more simple like flushing the CIL push workqueue when > over the throttle threshold? Yes, I've tried that - flush_workqueue() blocks /everything/ until all the queued work is complete. That means it waits for the CIL to flush to disk and write a commit record, and every modification in the filesystem is stopped dead in it's tracks until the CIL push is complete. The problem is that flushing workqueues is a synchronous operation, and it can't wait for partial work completion. We only need to wait for the CIL context to be swapped out - this is done by the push code before it starts all the expensive iclog formating and waiting for iclog space... > That would wait longer to fill the iclogs, > but would eliminate the need for another waitqueue that's apparently > only used for throttling. It may also facilitate reuse of > xlog_cil_push_now() in the >= XLOG_CIL_SPACE_BLOCKING_LIMIT() case > (actually, I think this could facilitate the elimination of > xlog_cil_push_background() entirely). xlog_cil_push_now() uses flush_work() to push any pending work before it queues up the CIL flush that the caller is about to wait for. i.e. the callers of xlog_cil_push_now() must ensure that all CIL contexts are flushed for the purposes of a log force as they are going to wait for all pending CIL flushes to complete. If we've already pushed the CIL to the sequence that we are asking to push to, we still have to wait for that previous push to be done. This is what the flush_work() call in xlog_cil_push_now() acheives. > That aside... > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_log_cil.c | 37 +++++++++++++++++++++++++++++++++---- > > fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++ > > fs/xfs/xfs_trace.h | 1 + > > 3 files changed, 58 insertions(+), 4 deletions(-) > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > index ef652abd112c..4a09d50e1368 100644 > > --- a/fs/xfs/xfs_log_cil.c > > +++ b/fs/xfs/xfs_log_cil.c > > @@ -670,6 +670,11 @@ xlog_cil_push( > > push_seq = cil->xc_push_seq; > > ASSERT(push_seq <= ctx->sequence); > > > > + /* > > + * Wake up any background push waiters now this context is being pushed. > > + */ > > + wake_up_all(&ctx->push_wait); > > + > > /* > > * Check if we've anything to push. If there is nothing, then we don't > > * move on to a new sequence number and so we have to be able to push > > @@ -746,6 +751,7 @@ xlog_cil_push( > > */ > > INIT_LIST_HEAD(&new_ctx->committing); > > INIT_LIST_HEAD(&new_ctx->busy_extents); > > + init_waitqueue_head(&new_ctx->push_wait); > > new_ctx->sequence = ctx->sequence + 1; > > new_ctx->cil = cil; > > cil->xc_ctx = new_ctx; > > @@ -900,7 +906,7 @@ xlog_cil_push_work( > > */ > > static void > > xlog_cil_push_background( > > - struct xlog *log) > > + struct xlog *log) __releases(cil->xc_ctx_lock) > > { > > struct xfs_cil *cil = log->l_cilp; > > > > @@ -914,14 +920,36 @@ xlog_cil_push_background( > > * don't do a background push if we haven't used up all the > > * space available yet. > > */ > > - if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) > > + if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) { > > + up_read(&cil->xc_ctx_lock); > > return; > > + } > > > > spin_lock(&cil->xc_push_lock); > > if (cil->xc_push_seq < cil->xc_current_sequence) { > > cil->xc_push_seq = cil->xc_current_sequence; > > queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); > > } > > + > > + /* > > + * Drop the context lock now, we can't hold that if we need to sleep > > + * because we are over the blocking threshold. The push_lock is still > > + * held, so blocking threshold sleep/wakeup is still correctly > > + * serialised here. > > + */ > > + up_read(&cil->xc_ctx_lock); > > + > > + /* > > + * If we are well over the space limit, throttle the work that is being > > + * done until the push work on this context has begun. > > + */ > > + if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { > > Can we use consistent logic with the rest of the function? I.e. > > up_read(..); > if (space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { > spin_unlock(..); > return; > } I did it this way because code inside the if() statement is typically pushed out of line by the compiler if it finished with a return statement (i.e. the compiler makes an "unlikely" static prediction). Hence it's best to put the slow path code inside thei if(), not the fast path code. I can change it, but then we're into ugly likely()/unlikely() macro territory and I much prefer to structure the code so things like that are completely unnecessary. > > ... > xlog_wait(...); > > > + trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket); > > + ASSERT(cil->xc_ctx->space_used < log->l_logsize); > > + xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock); > > + return; > > + } > > + > > Also, I find it slightly annoying that even with all of this locking > quirkiness we still manage to read ->space_used unprotected (now twice). > IMO, the simple thing would be to let xlog_cil_insert_items() return the > size immediately after the current transaction inserts and pass that > into this function as a parameter. Yes, I noticed this and thought about it a bit, but it doesn't really matter if the read is racy. It never really has mattered because it's not a hard limit where accuracy is vitally important for correct functioning. Locking and/or atomics are really only necessary if we care about exact accuracy, but we care only about avoiding fast path locking overhead as much as possible here. I'll deal with that in a separate patchset when I have some spare time. > > + * To prevent the CIL from overflowing upper commit size bounds, we introduce a > > + * new threshold at which we block committing transactions until the background > > + * CIL commit commences and switches to a new context. While this is not a hard > > + * limit, it forces the process committing a transaction to the CIL to block and > > + * yeild the CPU, giving the CIL push work a chance to be scheduled and start > > + * work. This prevents a process running lots of transactions from overfilling > > + * the CIL because it is not yielding the CPU. We set the blocking limit at > > + * twice the background push space threshold so we keep in line with the AIL > > + * push thresholds. > > + * > > + * Note: this is not a -hard- limit as blocking is applied after the transaction > > + * is inserted into the CIL and the push has been triggered. It is largely a > > + * throttling mechanism that allows the CIL push to be scheduled and run. A hard > > + * limit will be difficult to implement without introducing global serialisation > > + * in the CIL commit fast path, and it's not at all clear that we actually need > > + * such hard limits given the ~7 years we've run without a hard limit before > > + * finding the first situation where a checkpoint size overflow actually > > + * occurred. Hence the simple throttle, and an ASSERT check to tell us that > > + * we've overrun the max size. > > */ > > I appreciate the extra documentation here, but I think most of the > second paragraph is better commit log description material than > something worth lengthening this already huge comment for. I'd suggest > something like the following, but feel free to rework of course: Darrick asked for me include some of the information I had in the commit description directly in the comment, so I did exaclty what he asked.... > * > * Since the CIL push threshold only triggers a background push, a > * second threshold triggers transaction commit blocking until the > * background push initiates and switches the CIL context. The blocking > * threshold is set to twice the background push threshold to keep in > * line with AIL push thresholds. Note that this is not a CIL context > * size limit. This is a throttle threshold to slow the growth of the > * context and yield the CPU for a background push under overload > * conditions. Sure, but that doesn't tell us *why* it is implemented as a throttle threshold instead of a hard limit. This text tells us what the code does, not why it was implemented that way. I'll let darrick decide on this. Cheers, Dave. > */ > > Brian > > > #define XLOG_CIL_SPACE_LIMIT(log) \ > > min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4) > > > > +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log) \ > > + (XLOG_CIL_SPACE_LIMIT(log) * 2) > > + > > /* > > * ticket grant locks, queues and accounting have their own cachlines > > * as these are quite hot and can be operated on concurrently. > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > index eaae275ed430..e7087ede2662 100644 > > --- a/fs/xfs/xfs_trace.h > > +++ b/fs/xfs/xfs_trace.h > > @@ -1011,6 +1011,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_sub); > > DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_enter); > > DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_exit); > > DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_sub); > > +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait); > > > > DECLARE_EVENT_CLASS(xfs_log_item_class, > > TP_PROTO(struct xfs_log_item *lip), > > -- > > 2.23.0.rc1 > > > -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-09-30 21:53 ` Dave Chinner @ 2019-10-01 3:42 ` Dave Chinner 2019-10-01 13:13 ` Brian Foster 2019-10-01 13:13 ` Brian Foster 1 sibling, 1 reply; 23+ messages in thread From: Dave Chinner @ 2019-10-01 3:42 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote: > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote: > > On Mon, Sep 30, 2019 at 04:03:44PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > In certain situations the background CIL push can be indefinitely > > > delayed. While we have workarounds from the obvious cases now, it > > > doesn't solve the underlying issue. This issue is that there is no > > > upper limit on the CIL where we will either force or wait for > > > a background push to start, hence allowing the CIL to grow without > > > bound until it consumes all log space. > > > > > > To fix this, add a new wait queue to the CIL which allows background > > > pushes to wait for the CIL context to be switched out. This happens > > > when the push starts, so it will allow us to block incoming > > > transaction commit completion until the push has started. This will > > > only affect processes that are running modifications, and only when > > > the CIL threshold has been significantly overrun. > > > > > > This has no apparent impact on performance, and doesn't even trigger > > > until over 45 million inodes had been created in a 16-way fsmark > > > test on a 2GB log. That was limiting at 64MB of log space used, so > > > the active CIL size is only about 3% of the total log in that case. > > > The concurrent removal of those files did not trigger the background > > > sleep at all. > > > > > > > Have you done similar testing for small/minimum sized logs? > > Yes. I've had the tracepoint active during xfstests runs on test > filesystems using default log sizes on 5-15GB filesystems. The only > test in all of xfstests that has triggered it is generic/017, and it > only triggered once. > > e.g. > > # trace-cmd start -e xfs_log_cil_wait > <run xfstests> > # trace-cmd show > # tracer: nop > # > # entries-in-buffer/entries-written: 1/1 #P:4 > # > # _-----=> irqs-off > # / _----=> need-resched > # | / _---=> hardirq/softirq > # || / _--=> preempt-depth > # ||| / delay > # TASK-PID CPU# |||| TIMESTAMP FUNCTION > # | | | |||| | | > xfs_io-2158 [001] ...1 309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560 > # > > And the timestamp matched the time that generic/017 was running. SO I've run this on my typical 16-way fsmark workload with different size logs. It barely triggers on log sizes larger than 64MB, on 32MB logs I can see it capturing all 16 fsmark processes while waiting for the CIL context to switch. This will give you an idea of the log cycles the capture is occuring on, and the count of processes being captured: $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c 16 251 32 475 16 494 32 870 15 1132 15 1166 14 1221 1 1222 16 1223 7 1307 8 1308 16 1315 16 1738 16 1832 9 2167 7 2168 16 2200 16 2375 16 2383 16 2700 16 2797 16 2798 16 2892 $ So typically groups of captures are hundreds of log cycles apart (100 cycles x 32MB = ~3GB of log writes), then there will be a stutter where the CIL dispatch is delayed, and then everything continues on. These all show the log is always around the 75% full (AIL tail pushing theshold) but the reservation grant wait lists are always empty so we're not running out of reservation space here. If I make the log even smaller - 16MB - the log is always full, the AIL is always tail pushing, and there is a constant stream of log forces (30-40/s) because tail pushing is hitting pinned items several thousand times a second. IOWs, the frequency of the log forces means that CIL is almost never growing large enough to do a background push, let alone overrun the blocking threshold. Same trace for the same workload as above: $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c 16 1400 16 5284 16 5624 16 5778 16 6159 10 6477 $ So when we have lots of concurrency and modification, tiny logs appear to be less susceptible to CIL overruns than small logs because they are constantly tail pushing and issuing log forces that trigger trigger flushes of the CIL before an overruns could occur. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-01 3:42 ` Dave Chinner @ 2019-10-01 13:13 ` Brian Foster 2019-10-01 23:14 ` Dave Chinner 0 siblings, 1 reply; 23+ messages in thread From: Brian Foster @ 2019-10-01 13:13 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote: > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote: > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote: > > > On Mon, Sep 30, 2019 at 04:03:44PM +1000, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > In certain situations the background CIL push can be indefinitely > > > > delayed. While we have workarounds from the obvious cases now, it > > > > doesn't solve the underlying issue. This issue is that there is no > > > > upper limit on the CIL where we will either force or wait for > > > > a background push to start, hence allowing the CIL to grow without > > > > bound until it consumes all log space. > > > > > > > > To fix this, add a new wait queue to the CIL which allows background > > > > pushes to wait for the CIL context to be switched out. This happens > > > > when the push starts, so it will allow us to block incoming > > > > transaction commit completion until the push has started. This will > > > > only affect processes that are running modifications, and only when > > > > the CIL threshold has been significantly overrun. > > > > > > > > This has no apparent impact on performance, and doesn't even trigger > > > > until over 45 million inodes had been created in a 16-way fsmark > > > > test on a 2GB log. That was limiting at 64MB of log space used, so > > > > the active CIL size is only about 3% of the total log in that case. > > > > The concurrent removal of those files did not trigger the background > > > > sleep at all. > > > > > > > > > > Have you done similar testing for small/minimum sized logs? > > > > Yes. I've had the tracepoint active during xfstests runs on test > > filesystems using default log sizes on 5-15GB filesystems. The only > > test in all of xfstests that has triggered it is generic/017, and it > > only triggered once. > > > > e.g. > > > > # trace-cmd start -e xfs_log_cil_wait > > <run xfstests> > > # trace-cmd show > > # tracer: nop > > # > > # entries-in-buffer/entries-written: 1/1 #P:4 > > # > > # _-----=> irqs-off > > # / _----=> need-resched > > # | / _---=> hardirq/softirq > > # || / _--=> preempt-depth > > # ||| / delay > > # TASK-PID CPU# |||| TIMESTAMP FUNCTION > > # | | | |||| | | > > xfs_io-2158 [001] ...1 309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560 > > # > > > > And the timestamp matched the time that generic/017 was running. > > SO I've run this on my typical 16-way fsmark workload with different > size logs. It barely triggers on log sizes larger than 64MB, on 32MB > logs I can see it capturing all 16 fsmark processes while waiting > for the CIL context to switch. This will give you an idea of the > log cycles the capture is occuring on, and the count of processes > being captured: > > $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c > 16 251 > 32 475 > 16 494 > 32 870 > 15 1132 > 15 1166 > 14 1221 > 1 1222 > 16 1223 > 7 1307 > 8 1308 > 16 1315 > 16 1738 > 16 1832 > 9 2167 > 7 2168 > 16 2200 > 16 2375 > 16 2383 > 16 2700 > 16 2797 > 16 2798 > 16 2892 > $ Thanks. I assume I'm looking at cycle numbers and event counts here? > > So typically groups of captures are hundreds of log cycles apart > (100 cycles x 32MB = ~3GB of log writes), then there will be a > stutter where the CIL dispatch is delayed, and then everything > continues on. These all show the log is always around the 75% full > (AIL tail pushing theshold) but the reservation grant wait lists are > always empty so we're not running out of reservation space here. > It's somewhat interesting that we manage to block every thread most of the time before the CIL push task starts. I wonder a bit if that pattern would hold for a system/workload with more CPUs (and if so, if there are any odd side effects of stalling and waking hundreds of tasks at the same time vs. our traditional queuing behavior). > If I make the log even smaller - 16MB - the log is always full, the > AIL is always tail pushing, and there is a constant stream of log > forces (30-40/s) because tail pushing is hitting pinned items > several thousand times a second. IOWs, the frequency of the log > forces means that CIL is almost never growing large enough to do a > background push, let alone overrun the blocking threshold. Same > trace for the same workload as above: > > $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c > 16 1400 > 16 5284 > 16 5624 > 16 5778 > 16 6159 > 10 6477 > $ > > So when we have lots of concurrency and modification, tiny logs > appear to be less susceptible to CIL overruns than small logs > because they are constantly tail pushing and issuing log forces that > trigger trigger flushes of the CIL before an overruns could occur. > Makes sense, thanks. Brian > Cheers, > > Dave. > > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-01 13:13 ` Brian Foster @ 2019-10-01 23:14 ` Dave Chinner 2019-10-02 12:41 ` Brian Foster 0 siblings, 1 reply; 23+ messages in thread From: Dave Chinner @ 2019-10-01 23:14 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Tue, Oct 01, 2019 at 09:13:36AM -0400, Brian Foster wrote: > On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote: > > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote: > > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote: > > > > Have you done similar testing for small/minimum sized logs? > > > > > > Yes. I've had the tracepoint active during xfstests runs on test > > > filesystems using default log sizes on 5-15GB filesystems. The only > > > test in all of xfstests that has triggered it is generic/017, and it > > > only triggered once. > > > > > > e.g. > > > > > > # trace-cmd start -e xfs_log_cil_wait > > > <run xfstests> > > > # trace-cmd show > > > # tracer: nop > > > # > > > # entries-in-buffer/entries-written: 1/1 #P:4 > > > # > > > # _-----=> irqs-off > > > # / _----=> need-resched > > > # | / _---=> hardirq/softirq > > > # || / _--=> preempt-depth > > > # ||| / delay > > > # TASK-PID CPU# |||| TIMESTAMP FUNCTION > > > # | | | |||| | | > > > xfs_io-2158 [001] ...1 309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560 > > > # > > > > > > And the timestamp matched the time that generic/017 was running. > > > > SO I've run this on my typical 16-way fsmark workload with different > > size logs. It barely triggers on log sizes larger than 64MB, on 32MB > > logs I can see it capturing all 16 fsmark processes while waiting > > for the CIL context to switch. This will give you an idea of the > > log cycles the capture is occuring on, and the count of processes > > being captured: > > > > $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c > > 16 251 [snip] > > 16 2892 > > $ > > Thanks. I assume I'm looking at cycle numbers and event counts here? Yes. > > So typically groups of captures are hundreds of log cycles apart > > (100 cycles x 32MB = ~3GB of log writes), then there will be a > > stutter where the CIL dispatch is delayed, and then everything > > continues on. These all show the log is always around the 75% full > > (AIL tail pushing theshold) but the reservation grant wait lists are > > always empty so we're not running out of reservation space here. > > > > It's somewhat interesting that we manage to block every thread most of > the time before the CIL push task starts. I wonder a bit if that pattern > would hold for a system/workload with more CPUs (and if so, if there are > any odd side effects of stalling and waking hundreds of tasks at the > same time vs. our traditional queuing behavior). If I increase the concurrency (e.g. 16->32 threads for fsmark on a 64MB log), we hammer the spinlock on the grant head -hard-. i.e. CPU usage goes up by 40%, performance goes down by 50%, and all that CPU time is spent spinning on the reserve grant head lock. Basically, the log reservation space runs out, and we end up queuing on the reservation grant head and then we get reminded of just how bad having a serialisation point in the reservation fast path actually is for scalability... + 41.05% 37.93% [kernel] [k] __pv_queued_spin_lock_slowpath .... - 30.37% 0.02% [kernel] [k] xlog_grant_head_check - 17.98% xlog_grant_head_check - 14.17% _raw_spin_lock + 14.47% __pv_queued_spin_lock_slowpath + 8.39% xlog_grant_head_wait + 1.17% xlog_grant_head_wake + 26.47% 0.10% [kernel] [k] __xfs_trans_commit - 26.06% 1.15% [kernel] [k] xfs_log_commit_cil - 15.69% xfs_log_commit_cil - 9.90% xfs_log_done + 9.83% xfs_log_space_wake + 2.28% xfs_buf_item_format + 0.57% _raw_spin_lock - 16.50% 0.01% [kernel] [k] xfs_log_done - 9.90% xfs_log_done - 9.84% xfs_log_space_wake + 13.63% _raw_spin_lock + 1.39% xlog_grant_head_wake - 16.22% 0.02% [kernel] [k] xfs_log_space_wake - 9.83% xfs_log_space_wake + 13.64% _raw_spin_lock + 1.40% xlog_grant_head_wake So, essentially, increasing the concurrency on a small log quickly hits the log reservation limits (because tail pushing!) and that largely prevents the CIL from overrrunning, too. Compared to the 16-way/64MB log output I snipped above, here's the 32-way CIL blocking: $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c 30 994 28 1610 4 1611 $ Much less, but it still tends to block most threads when it occurs. I think we can largely ignore the potential thundering heard here because it seems quite rare comapred to the amount of throttling being done by the grant head... Larger logs block more threads on the CIL throttle, but the 32MB CIL window can soak up hundreds of max sized transaction reservations before overflowing so even running several hundred concurrent modification threads I haven't been able to drive enough concurrency through the CIL to see any sort of adverse behaviour. And the workloads are running pretty consistently at less than 5,000 context switches/sec so there's no evidence of repeated thundering heard wakeup problems, either. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-01 23:14 ` Dave Chinner @ 2019-10-02 12:41 ` Brian Foster 2019-10-03 1:25 ` Dave Chinner 0 siblings, 1 reply; 23+ messages in thread From: Brian Foster @ 2019-10-02 12:41 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Wed, Oct 02, 2019 at 09:14:33AM +1000, Dave Chinner wrote: > On Tue, Oct 01, 2019 at 09:13:36AM -0400, Brian Foster wrote: > > On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote: > > > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote: > > > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote: > > > > > Have you done similar testing for small/minimum sized logs? > > > > > > > > Yes. I've had the tracepoint active during xfstests runs on test > > > > filesystems using default log sizes on 5-15GB filesystems. The only > > > > test in all of xfstests that has triggered it is generic/017, and it > > > > only triggered once. > > > > > > > > e.g. > > > > > > > > # trace-cmd start -e xfs_log_cil_wait > > > > <run xfstests> > > > > # trace-cmd show > > > > # tracer: nop > > > > # > > > > # entries-in-buffer/entries-written: 1/1 #P:4 > > > > # > > > > # _-----=> irqs-off > > > > # / _----=> need-resched > > > > # | / _---=> hardirq/softirq > > > > # || / _--=> preempt-depth > > > > # ||| / delay > > > > # TASK-PID CPU# |||| TIMESTAMP FUNCTION > > > > # | | | |||| | | > > > > xfs_io-2158 [001] ...1 309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560 > > > > # > > > > > > > > And the timestamp matched the time that generic/017 was running. > > > > > > SO I've run this on my typical 16-way fsmark workload with different > > > size logs. It barely triggers on log sizes larger than 64MB, on 32MB > > > logs I can see it capturing all 16 fsmark processes while waiting > > > for the CIL context to switch. This will give you an idea of the > > > log cycles the capture is occuring on, and the count of processes > > > being captured: > > > > > > $ sudo trace-cmd show | awk -e '/^ / {print $23}' | sort -n |uniq -c > > > 16 251 > [snip] > > > 16 2892 > > > $ > > > > Thanks. I assume I'm looking at cycle numbers and event counts here? > > Yes. > > > > So typically groups of captures are hundreds of log cycles apart > > > (100 cycles x 32MB = ~3GB of log writes), then there will be a > > > stutter where the CIL dispatch is delayed, and then everything > > > continues on. These all show the log is always around the 75% full > > > (AIL tail pushing theshold) but the reservation grant wait lists are > > > always empty so we're not running out of reservation space here. > > > > > > > It's somewhat interesting that we manage to block every thread most of > > the time before the CIL push task starts. I wonder a bit if that pattern > > would hold for a system/workload with more CPUs (and if so, if there are > > any odd side effects of stalling and waking hundreds of tasks at the > > same time vs. our traditional queuing behavior). > > If I increase the concurrency (e.g. 16->32 threads for fsmark on a > 64MB log), we hammer the spinlock on the grant head -hard-. i.e. CPU > usage goes up by 40%, performance goes down by 50%, and all that CPU > time is spent spinning on the reserve grant head lock. Basically, > the log reservation space runs out, and we end up queuing on the > reservation grant head and then we get reminded of just how bad > having a serialisation point in the reservation fast path actually > is for scalability... > The small log case is not really what I'm wondering about. Does this behavior translate to a similar test with a maximum sized log? ... > > Larger logs block more threads on the CIL throttle, but the 32MB CIL > window can soak up hundreds of max sized transaction reservations > before overflowing so even running several hundred concurrent > modification threads I haven't been able to drive enough concurrency > through the CIL to see any sort of adverse behaviour. And the > workloads are running pretty consistently at less than 5,000 context > switches/sec so there's no evidence of repeated thundering heard > wakeup problems, either. > That speaks to the rarity of the throttle, which is good. But I'm wondering, for example, what might happen on systems where we could have hundreds of physical CPUs committing to the CIL, we block them all on the throttle and then wake them all at once. IOW, can we potentially create the contention conditions you reproduce above in scenarios where they might not have existed before? Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-02 12:41 ` Brian Foster @ 2019-10-03 1:25 ` Dave Chinner 2019-10-03 14:41 ` Brian Foster 0 siblings, 1 reply; 23+ messages in thread From: Dave Chinner @ 2019-10-03 1:25 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Wed, Oct 02, 2019 at 08:41:39AM -0400, Brian Foster wrote: > On Wed, Oct 02, 2019 at 09:14:33AM +1000, Dave Chinner wrote: > > On Tue, Oct 01, 2019 at 09:13:36AM -0400, Brian Foster wrote: > > > On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote: > > > > So typically groups of captures are hundreds of log cycles apart > > > > (100 cycles x 32MB = ~3GB of log writes), then there will be a > > > > stutter where the CIL dispatch is delayed, and then everything > > > > continues on. These all show the log is always around the 75% full > > > > (AIL tail pushing theshold) but the reservation grant wait lists are > > > > always empty so we're not running out of reservation space here. > > > > > > > > > > It's somewhat interesting that we manage to block every thread most of > > > the time before the CIL push task starts. I wonder a bit if that pattern > > > would hold for a system/workload with more CPUs (and if so, if there are > > > any odd side effects of stalling and waking hundreds of tasks at the > > > same time vs. our traditional queuing behavior). > > > > If I increase the concurrency (e.g. 16->32 threads for fsmark on a > > 64MB log), we hammer the spinlock on the grant head -hard-. i.e. CPU > > usage goes up by 40%, performance goes down by 50%, and all that CPU > > time is spent spinning on the reserve grant head lock. Basically, > > the log reservation space runs out, and we end up queuing on the > > reservation grant head and then we get reminded of just how bad > > having a serialisation point in the reservation fast path actually > > is for scalability... > > > > The small log case is not really what I'm wondering about. Does this > behavior translate to a similar test with a maximum sized log? Nope, the transactions all hit the CIL throttle within a couple of hundred microseconds of each other, then the CIL push schedules, and then a couple of hundred microseconds later they are unblocked because the CIL push has started. > ... > > > > Larger logs block more threads on the CIL throttle, but the 32MB CIL > > window can soak up hundreds of max sized transaction reservations > > before overflowing so even running several hundred concurrent > > modification threads I haven't been able to drive enough concurrency > > through the CIL to see any sort of adverse behaviour. And the > > workloads are running pretty consistently at less than 5,000 context > > switches/sec so there's no evidence of repeated thundering heard > > wakeup problems, either. > > > > That speaks to the rarity of the throttle, which is good. But I'm > wondering, for example, what might happen on systems where we could have > hundreds of physical CPUs committing to the CIL, we block them all on > the throttle and then wake them all at once. IOW, can we potentially > create the contention conditions you reproduce above in scenarios where > they might not have existed before? I don't think it will create any new contention points - the contention I described above can be triggered without the CIL throttle in place, too. It just requires enough concurrent transactions to exhaust the entire log reservation, and then we go from a lockless grant head reservation algorithm to a spinlock serialised waiting algorithm. i.e. the contention starts when we have enough concurrency to fall off the lockless fast path. So with a 2GB log and fast storage, we likely need a sustained workload of tens of thousands of concurrent transaction reservations to exhaust log space and drive us into this situation. We generally don't have applications that have this sort of concurrency capability... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-03 1:25 ` Dave Chinner @ 2019-10-03 14:41 ` Brian Foster 2019-10-04 2:27 ` Dave Chinner 0 siblings, 1 reply; 23+ messages in thread From: Brian Foster @ 2019-10-03 14:41 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Thu, Oct 03, 2019 at 11:25:56AM +1000, Dave Chinner wrote: > On Wed, Oct 02, 2019 at 08:41:39AM -0400, Brian Foster wrote: > > On Wed, Oct 02, 2019 at 09:14:33AM +1000, Dave Chinner wrote: > > > On Tue, Oct 01, 2019 at 09:13:36AM -0400, Brian Foster wrote: > > > > On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote: > > > > > So typically groups of captures are hundreds of log cycles apart > > > > > (100 cycles x 32MB = ~3GB of log writes), then there will be a > > > > > stutter where the CIL dispatch is delayed, and then everything > > > > > continues on. These all show the log is always around the 75% full > > > > > (AIL tail pushing theshold) but the reservation grant wait lists are > > > > > always empty so we're not running out of reservation space here. > > > > > > > > > > > > > It's somewhat interesting that we manage to block every thread most of > > > > the time before the CIL push task starts. I wonder a bit if that pattern > > > > would hold for a system/workload with more CPUs (and if so, if there are > > > > any odd side effects of stalling and waking hundreds of tasks at the > > > > same time vs. our traditional queuing behavior). > > > > > > If I increase the concurrency (e.g. 16->32 threads for fsmark on a > > > 64MB log), we hammer the spinlock on the grant head -hard-. i.e. CPU > > > usage goes up by 40%, performance goes down by 50%, and all that CPU > > > time is spent spinning on the reserve grant head lock. Basically, > > > the log reservation space runs out, and we end up queuing on the > > > reservation grant head and then we get reminded of just how bad > > > having a serialisation point in the reservation fast path actually > > > is for scalability... > > > > > > > The small log case is not really what I'm wondering about. Does this > > behavior translate to a similar test with a maximum sized log? > > Nope, the transactions all hit the CIL throttle within a couple of > hundred microseconds of each other, then the CIL push schedules, and > then a couple of hundred microseconds later they are unblocked > because the CIL push has started. > > > ... > > > > > > Larger logs block more threads on the CIL throttle, but the 32MB CIL > > > window can soak up hundreds of max sized transaction reservations > > > before overflowing so even running several hundred concurrent > > > modification threads I haven't been able to drive enough concurrency > > > through the CIL to see any sort of adverse behaviour. And the > > > workloads are running pretty consistently at less than 5,000 context > > > switches/sec so there's no evidence of repeated thundering heard > > > wakeup problems, either. > > > > > > > That speaks to the rarity of the throttle, which is good. But I'm > > wondering, for example, what might happen on systems where we could have > > hundreds of physical CPUs committing to the CIL, we block them all on > > the throttle and then wake them all at once. IOW, can we potentially > > create the contention conditions you reproduce above in scenarios where > > they might not have existed before? > > I don't think it will create any new contention points - the > contention I described above can be triggered without the CIL > throttle in place, too. It just requires enough concurrent > transactions to exhaust the entire log reservation, and then we go > from a lockless grant head reservation algorithm to a spinlock > serialised waiting algorithm. i.e. the contention starts when we > have enough concurrency to fall off the lockless fast path. > > So with a 2GB log and fast storage, we likely need a sustained > workload of tens of thousands of concurrent transaction reservations > to exhaust log space and drive us into this situation. We generally > don't have applications that have this sort of concurrency > capability... > That there are some systems/configurations out there that are fast enough to avoid this condition doesn't really answer the question. If you assume something like a 1TB fs and 500MB log, with 1/4 the log consumed in the AIL and 64MB in the CIL (such that transaction commits start to block), the remaining log reservation can easily be consumed by something on the order of 100 open transactions. Hmm, I'm also not sure the lockless reservation algorithm is totally immune to increased concurrency in this regard. What prevents multiple tasks from racing through xlog_grant_head_check() and blowing past the log head, for example? I gave this a quick test out of curiosity and with a 15GB fs with a 10MB log, I should only be able to send 5 or so truncate transactions through xfs_log_reserve() before blocking. With a couple injected delays, I'm easily able to send 32 into the grant space modification code and that eventually results in something like this: truncate-1233 [002] ...1 1520.396545: xfs_log_reserve_exit: dev 253:4 t_ocnt 8 t_cnt 8 t_curr_res 266260 t_unit_res 266260 t_flags XLOG_TIC_INITED|XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 7 grant_reserve_bytes 5306880 grant_write_cycle 7 grant_write_bytes 5306880 curr_cycle 1 curr_block 115 tail_cycle 1 tail_block 115 ... where the grant heads have not only blown the tail, but cycled around the log multiple times. :/ Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-03 14:41 ` Brian Foster @ 2019-10-04 2:27 ` Dave Chinner 2019-10-04 11:50 ` Brian Foster 0 siblings, 1 reply; 23+ messages in thread From: Dave Chinner @ 2019-10-04 2:27 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Thu, Oct 03, 2019 at 10:41:14AM -0400, Brian Foster wrote: > On Thu, Oct 03, 2019 at 11:25:56AM +1000, Dave Chinner wrote: > > On Wed, Oct 02, 2019 at 08:41:39AM -0400, Brian Foster wrote: > > > On Wed, Oct 02, 2019 at 09:14:33AM +1000, Dave Chinner wrote: > > > > On Tue, Oct 01, 2019 at 09:13:36AM -0400, Brian Foster wrote: > > > > > On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote: > > > > > > So typically groups of captures are hundreds of log cycles apart > > > > > > (100 cycles x 32MB = ~3GB of log writes), then there will be a > > > > > > stutter where the CIL dispatch is delayed, and then everything > > > > > > continues on. These all show the log is always around the 75% full > > > > > > (AIL tail pushing theshold) but the reservation grant wait lists are > > > > > > always empty so we're not running out of reservation space here. > > > > > > > > > > > > > > > > It's somewhat interesting that we manage to block every thread most of > > > > > the time before the CIL push task starts. I wonder a bit if that pattern > > > > > would hold for a system/workload with more CPUs (and if so, if there are > > > > > any odd side effects of stalling and waking hundreds of tasks at the > > > > > same time vs. our traditional queuing behavior). > > > > > > > > If I increase the concurrency (e.g. 16->32 threads for fsmark on a > > > > 64MB log), we hammer the spinlock on the grant head -hard-. i.e. CPU > > > > usage goes up by 40%, performance goes down by 50%, and all that CPU > > > > time is spent spinning on the reserve grant head lock. Basically, > > > > the log reservation space runs out, and we end up queuing on the > > > > reservation grant head and then we get reminded of just how bad > > > > having a serialisation point in the reservation fast path actually > > > > is for scalability... > > > > > > > > > > The small log case is not really what I'm wondering about. Does this > > > behavior translate to a similar test with a maximum sized log? > > > > Nope, the transactions all hit the CIL throttle within a couple of > > hundred microseconds of each other, then the CIL push schedules, and > > then a couple of hundred microseconds later they are unblocked > > because the CIL push has started. > > > > > ... > > > > > > > > Larger logs block more threads on the CIL throttle, but the 32MB CIL > > > > window can soak up hundreds of max sized transaction reservations > > > > before overflowing so even running several hundred concurrent > > > > modification threads I haven't been able to drive enough concurrency > > > > through the CIL to see any sort of adverse behaviour. And the > > > > workloads are running pretty consistently at less than 5,000 context > > > > switches/sec so there's no evidence of repeated thundering heard > > > > wakeup problems, either. > > > > > > > > > > That speaks to the rarity of the throttle, which is good. But I'm > > > wondering, for example, what might happen on systems where we could have > > > hundreds of physical CPUs committing to the CIL, we block them all on > > > the throttle and then wake them all at once. IOW, can we potentially > > > create the contention conditions you reproduce above in scenarios where > > > they might not have existed before? > > > > I don't think it will create any new contention points - the > > contention I described above can be triggered without the CIL > > throttle in place, too. It just requires enough concurrent > > transactions to exhaust the entire log reservation, and then we go > > from a lockless grant head reservation algorithm to a spinlock > > serialised waiting algorithm. i.e. the contention starts when we > > have enough concurrency to fall off the lockless fast path. > > > > So with a 2GB log and fast storage, we likely need a sustained > > workload of tens of thousands of concurrent transaction reservations > > to exhaust log space and drive us into this situation. We generally > > don't have applications that have this sort of concurrency > > capability... > > > > That there are some systems/configurations out there that are fast > enough to avoid this condition doesn't really answer the question. If > you assume something like a 1TB fs and 500MB log, with 1/4 the log > consumed in the AIL and 64MB in the CIL (such that transaction commits > start to block), the remaining log reservation can easily be consumed by > something on the order of 100 open transactions. If individual reservations in in the order of 3MB on a 4kB filesystem, then we've screwed reservations up really badly. /me goes looking Reservation sizes for 1TB AGs and reflink enabled show: xfs_trans_resv_calc: dev 253:32 type 0 logres 201976 logcount 8 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 1 logres 361344 logcount 8 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 2 logres 307936 logcount 2 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 3 logres 187760 logcount 2 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 4 logres 197760 logcount 2 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 5 logres 303616 logcount 3 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 6 logres 302464 logcount 2 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 7 logres 319488 logcount 2 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 8 logres 302464 logcount 3 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 9 logres 337784 logcount 2 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 10 logres 2168 logcount 0 flags 0x0 xfs_trans_resv_calc: dev 253:32 type 11 logres 90624 logcount 2 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 12 logres 116856 logcount 2 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 13 logres 760 logcount 0 flags 0x0 xfs_trans_resv_calc: dev 253:32 type 14 logres 360576 logcount 1 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 15 logres 23288 logcount 3 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 16 logres 13312 logcount 0 flags 0x0 xfs_trans_resv_calc: dev 253:32 type 17 logres 181376 logcount 3 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 18 logres 640 logcount 0 flags 0x0 xfs_trans_resv_calc: dev 253:32 type 19 logres 111864 logcount 2 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 20 logres 4224 logcount 0 flags 0x0 xfs_trans_resv_calc: dev 253:32 type 21 logres 6512 logcount 0 flags 0x0 xfs_trans_resv_calc: dev 253:32 type 22 logres 232 logcount 1 flags 0x0 xfs_trans_resv_calc: dev 253:32 type 23 logres 206199 logcount 8 flags 0x4 xfs_trans_resv_calc: dev 253:32 type 24 logres 976 logcount 1 flags 0x0 xfs_trans_resv_calc: dev 253:32 type 25 logres 336 logcount 1 flags 0x0 xfs_trans_resv_calc: dev 253:32 type 26 logres 640 logcount 1 flags 0x0 xfs_trans_resv_calc: dev 253:32 type 27 logres 760 logcount 0 flags 0x0 xfs_trans_resv_calc: dev 253:32 type -1 logres 361344 logcount 8 flags 0x4 Oh, wow, those are actually massive. A write reservation should not need a 200kB reservation - it used to be in the order of 100kB and with a log count of 3. And that truncate reservation @ 360kB is way more than it needs to be. I'm going to have a deeper look into this, because I strongly suspect we've screwed things up here with respect to rmap/reflink and deferred ops. > Hmm, I'm also not sure the lockless reservation algorithm is totally > immune to increased concurrency in this regard. What prevents multiple > tasks from racing through xlog_grant_head_check() and blowing past the > log head, for example? Nothing. Debug kernels even emit a "xlog_verify_grant_tail: space > BBTOB(tail_blocks)" messages when that happens. It's pretty difficult to do this in real world conditions, even when there is lots of concurrency being used. But here's the rub: it's not actually the end of the world because the reservation doesn't actually determine how much of the log is currently being used by running transactions - the reservation is for a maximal rolling iteration of a permanent transaction, not the initial transaction will be running. Hence if we overrun occassionally we don't immediately run out of log space and corrupt the log. Yes, if none of the rolling transactions complete and they all need to use their entire reservation, and the tail of the log cannot be moved forward because it is pinned by one of the transactions that is running, then we'll likely get a log hang on a regrant on the write head. But if any of the transactions don't use all of their reservation, then the overrun gets soaked up by the unused parts of the transactions that are completed and returned to reservation head, and nobody even notices taht there was a temporary overrun of the grant head space. Hence occasional overruns on the reservation head before they start blocking isn't really a problem in practice because the probability of all the transaction reservation of all transactions running being required to make forwards progress is extremely small. Basically, we gave up "perfect reservation space grant accounting" because performance was extremely important and risk of log hangs as a result of overruns was considered to be extremely low and worth taking for the benefits the algorithm provided. This was just a simple, pragmatic risk based engineering decision. > I gave this a quick test out of curiosity and with a 15GB fs with a 10MB > log, I should only be able to send 5 or so truncate transactions through > xfs_log_reserve() before blocking. With a couple injected delays, I'm > easily able to send 32 into the grant space modification code and that > eventually results in something like this: Yup, we can break lots of things by injecting artificial delays, but that doesn't mean there is a practical real world problem with the algorithm we need to fix.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-04 2:27 ` Dave Chinner @ 2019-10-04 11:50 ` Brian Foster 2019-10-08 2:51 ` Dave Chinner 0 siblings, 1 reply; 23+ messages in thread From: Brian Foster @ 2019-10-04 11:50 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Fri, Oct 04, 2019 at 12:27:55PM +1000, Dave Chinner wrote: > On Thu, Oct 03, 2019 at 10:41:14AM -0400, Brian Foster wrote: > > On Thu, Oct 03, 2019 at 11:25:56AM +1000, Dave Chinner wrote: > > > On Wed, Oct 02, 2019 at 08:41:39AM -0400, Brian Foster wrote: > > > > On Wed, Oct 02, 2019 at 09:14:33AM +1000, Dave Chinner wrote: > > > > > On Tue, Oct 01, 2019 at 09:13:36AM -0400, Brian Foster wrote: > > > > > > On Tue, Oct 01, 2019 at 01:42:07PM +1000, Dave Chinner wrote: > > > > > > > So typically groups of captures are hundreds of log cycles apart > > > > > > > (100 cycles x 32MB = ~3GB of log writes), then there will be a > > > > > > > stutter where the CIL dispatch is delayed, and then everything > > > > > > > continues on. These all show the log is always around the 75% full > > > > > > > (AIL tail pushing theshold) but the reservation grant wait lists are > > > > > > > always empty so we're not running out of reservation space here. > > > > > > > > > > > > > > > > > > > It's somewhat interesting that we manage to block every thread most of > > > > > > the time before the CIL push task starts. I wonder a bit if that pattern > > > > > > would hold for a system/workload with more CPUs (and if so, if there are > > > > > > any odd side effects of stalling and waking hundreds of tasks at the > > > > > > same time vs. our traditional queuing behavior). > > > > > > > > > > If I increase the concurrency (e.g. 16->32 threads for fsmark on a > > > > > 64MB log), we hammer the spinlock on the grant head -hard-. i.e. CPU > > > > > usage goes up by 40%, performance goes down by 50%, and all that CPU > > > > > time is spent spinning on the reserve grant head lock. Basically, > > > > > the log reservation space runs out, and we end up queuing on the > > > > > reservation grant head and then we get reminded of just how bad > > > > > having a serialisation point in the reservation fast path actually > > > > > is for scalability... > > > > > > > > > > > > > The small log case is not really what I'm wondering about. Does this > > > > behavior translate to a similar test with a maximum sized log? > > > > > > Nope, the transactions all hit the CIL throttle within a couple of > > > hundred microseconds of each other, then the CIL push schedules, and > > > then a couple of hundred microseconds later they are unblocked > > > because the CIL push has started. > > > > > > > ... > > > > > > > > > > Larger logs block more threads on the CIL throttle, but the 32MB CIL > > > > > window can soak up hundreds of max sized transaction reservations > > > > > before overflowing so even running several hundred concurrent > > > > > modification threads I haven't been able to drive enough concurrency > > > > > through the CIL to see any sort of adverse behaviour. And the > > > > > workloads are running pretty consistently at less than 5,000 context > > > > > switches/sec so there's no evidence of repeated thundering heard > > > > > wakeup problems, either. > > > > > > > > > > > > > That speaks to the rarity of the throttle, which is good. But I'm > > > > wondering, for example, what might happen on systems where we could have > > > > hundreds of physical CPUs committing to the CIL, we block them all on > > > > the throttle and then wake them all at once. IOW, can we potentially > > > > create the contention conditions you reproduce above in scenarios where > > > > they might not have existed before? > > > > > > I don't think it will create any new contention points - the > > > contention I described above can be triggered without the CIL > > > throttle in place, too. It just requires enough concurrent > > > transactions to exhaust the entire log reservation, and then we go > > > from a lockless grant head reservation algorithm to a spinlock > > > serialised waiting algorithm. i.e. the contention starts when we > > > have enough concurrency to fall off the lockless fast path. > > > > > > So with a 2GB log and fast storage, we likely need a sustained > > > workload of tens of thousands of concurrent transaction reservations > > > to exhaust log space and drive us into this situation. We generally > > > don't have applications that have this sort of concurrency > > > capability... > > > > > > > That there are some systems/configurations out there that are fast > > enough to avoid this condition doesn't really answer the question. If > > you assume something like a 1TB fs and 500MB log, with 1/4 the log > > consumed in the AIL and 64MB in the CIL (such that transaction commits > > start to block), the remaining log reservation can easily be consumed by > > something on the order of 100 open transactions. > > If individual reservations in in the order of 3MB on a 4kB > filesystem, then we've screwed reservations up really badly. > > /me goes looking > > Reservation sizes for 1TB AGs and reflink enabled show: > > xfs_trans_resv_calc: dev 253:32 type 0 logres 201976 logcount 8 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 1 logres 361344 logcount 8 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 2 logres 307936 logcount 2 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 3 logres 187760 logcount 2 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 4 logres 197760 logcount 2 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 5 logres 303616 logcount 3 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 6 logres 302464 logcount 2 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 7 logres 319488 logcount 2 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 8 logres 302464 logcount 3 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 9 logres 337784 logcount 2 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 10 logres 2168 logcount 0 flags 0x0 > xfs_trans_resv_calc: dev 253:32 type 11 logres 90624 logcount 2 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 12 logres 116856 logcount 2 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 13 logres 760 logcount 0 flags 0x0 > xfs_trans_resv_calc: dev 253:32 type 14 logres 360576 logcount 1 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 15 logres 23288 logcount 3 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 16 logres 13312 logcount 0 flags 0x0 > xfs_trans_resv_calc: dev 253:32 type 17 logres 181376 logcount 3 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 18 logres 640 logcount 0 flags 0x0 > xfs_trans_resv_calc: dev 253:32 type 19 logres 111864 logcount 2 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 20 logres 4224 logcount 0 flags 0x0 > xfs_trans_resv_calc: dev 253:32 type 21 logres 6512 logcount 0 flags 0x0 > xfs_trans_resv_calc: dev 253:32 type 22 logres 232 logcount 1 flags 0x0 > xfs_trans_resv_calc: dev 253:32 type 23 logres 206199 logcount 8 flags 0x4 > xfs_trans_resv_calc: dev 253:32 type 24 logres 976 logcount 1 flags 0x0 > xfs_trans_resv_calc: dev 253:32 type 25 logres 336 logcount 1 flags 0x0 > xfs_trans_resv_calc: dev 253:32 type 26 logres 640 logcount 1 flags 0x0 > xfs_trans_resv_calc: dev 253:32 type 27 logres 760 logcount 0 flags 0x0 > xfs_trans_resv_calc: dev 253:32 type -1 logres 361344 logcount 8 flags 0x4 > > Oh, wow, those are actually massive. A write reservation should > not need a 200kB reservation - it used to be in the order of 100kB > and with a log count of 3. And that truncate reservation @ 360kB is > way more than it needs to be. > Yeah, reflink extends some of the log counts quite a bit. > I'm going to have a deeper look into this, because I strongly > suspect we've screwed things up here with respect to rmap/reflink > and deferred ops. > Ok. > > Hmm, I'm also not sure the lockless reservation algorithm is totally > > immune to increased concurrency in this regard. What prevents multiple > > tasks from racing through xlog_grant_head_check() and blowing past the > > log head, for example? > > Nothing. Debug kernels even emit a "xlog_verify_grant_tail: space > > BBTOB(tail_blocks)" messages when that happens. It's pretty > difficult to do this in real world conditions, even when there is > lots of concurrency being used. > Hm, Ok. Though I've seen that alert enough times that I (unintentionally) ignore it at this point, so it can't be that hard to reproduce. ;) That is usually during fstests however, and not a typical workload that I recall. Of course, there's a difference between reproducing the basic condition and taking it to the point where it manifests into a problem. > But here's the rub: it's not actually the end of the world because > the reservation doesn't actually determine how much of the log is > currently being used by running transactions - the reservation is > for a maximal rolling iteration of a permanent transaction, not the > initial transaction will be running. Hence if we overrun > occassionally we don't immediately run out of log space and corrupt > the log. > Ok, that much is evident from the amount of time this mechanism has been in place without any notable issues. > Yes, if none of the rolling transactions complete and they all need > to use their entire reservation, and the tail of the log cannot be > moved forward because it is pinned by one of the transactions that > is running, then we'll likely get a log hang on a regrant on the > write head. But if any of the transactions don't use all of their > reservation, then the overrun gets soaked up by the unused parts of > the transactions that are completed and returned to reservation > head, and nobody even notices taht there was a temporary overrun of > the grant head space. > Ok, I didn't expect this to be some catastrophic problem or really a problem with your patch simply based on the lifetime of the code and how the grant heads are actually used. I was going to suggest an assert or something to detect whether batching behavior as a side effect of the commit throttle would ever increase likelihood of this situation, but it looks like the grant verify function somewhat serves that purpose already. I'd _prefer_ to see something, at least in DEBUG mode, that indicates the frequency of the fundamental incorrect accounting condition as opposed to just the side effect of blowing the tail (because the latter depends on other difficult to reproduce factors), but I'd have to think about that some more as it would need to balance against normal/expected execution flow. Thanks for the background. > Hence occasional overruns on the reservation head before they start > blocking isn't really a problem in practice because the probability > of all the transaction reservation of all transactions running being > required to make forwards progress is extremely small. > > Basically, we gave up "perfect reservation space grant accounting" > because performance was extremely important and risk of log hangs as > a result of overruns was considered to be extremely low and worth > taking for the benefits the algorithm provided. This was just a > simple, pragmatic risk based engineering decision. > FWIW, the comment for xlog_verify_tail() also suggests the potential for false positives and references a panic tag, which all seems kind of erratic and misleading compared to what you explain here. Unless I've missed it somewhere, it would be nice if this intention were documented somewhere more clearly. Perhaps I'll look into rewriting that comment too.. > > I gave this a quick test out of curiosity and with a 15GB fs with a 10MB > > log, I should only be able to send 5 or so truncate transactions through > > xfs_log_reserve() before blocking. With a couple injected delays, I'm > > easily able to send 32 into the grant space modification code and that > > eventually results in something like this: > > Yup, we can break lots of things by injecting artificial delays, but > that doesn't mean there is a practical real world problem with the > algorithm we need to fix.... > Delay (or error) injection is just a debugging technique to control and observe execution flow. It doesn't necessarily speak to whether something is practically reproducible or a bug or not, merely whether it's possible, which is precisely why I try to point out behavior that is manufactured vs reproduced via standard operations. Any reasoning beyond that depends on context. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-04 11:50 ` Brian Foster @ 2019-10-08 2:51 ` Dave Chinner 2019-10-08 13:22 ` Brian Foster 0 siblings, 1 reply; 23+ messages in thread From: Dave Chinner @ 2019-10-08 2:51 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Fri, Oct 04, 2019 at 07:50:01AM -0400, Brian Foster wrote: > On Fri, Oct 04, 2019 at 12:27:55PM +1000, Dave Chinner wrote: > > On Thu, Oct 03, 2019 at 10:41:14AM -0400, Brian Foster wrote: > > > Hmm, I'm also not sure the lockless reservation algorithm is totally > > > immune to increased concurrency in this regard. What prevents multiple > > > tasks from racing through xlog_grant_head_check() and blowing past the > > > log head, for example? > > > > Nothing. Debug kernels even emit a "xlog_verify_grant_tail: space > > > BBTOB(tail_blocks)" messages when that happens. It's pretty > > difficult to do this in real world conditions, even when there is > > lots of concurrency being used. > > > > Hm, Ok. Though I've seen that alert enough times that I > (unintentionally) ignore it at this point, so it can't be that hard to > reproduce. ;) That is usually during fstests however, and not a typical > workload that I recall. I can't say I've seen it for a long time now - I want to say "years" but I may well have simply missed it on the rare occasion it has occurred and fstests hasn't captured it. i.e. fstests is supposed to capture unusual things like this appearing in dmesg during a test.... > Of course, there's a difference between > reproducing the basic condition and taking it to the point where it > manifests into a problem. *nod* > > But here's the rub: it's not actually the end of the world because > > the reservation doesn't actually determine how much of the log is > > currently being used by running transactions - the reservation is > > for a maximal rolling iteration of a permanent transaction, not the > > initial transaction will be running. Hence if we overrun > > occassionally we don't immediately run out of log space and corrupt > > the log. > > > > Ok, that much is evident from the amount of time this mechanism has been > in place without any notable issues. > > > Yes, if none of the rolling transactions complete and they all need > > to use their entire reservation, and the tail of the log cannot be > > moved forward because it is pinned by one of the transactions that > > is running, then we'll likely get a log hang on a regrant on the > > write head. But if any of the transactions don't use all of their > > reservation, then the overrun gets soaked up by the unused parts of > > the transactions that are completed and returned to reservation > > head, and nobody even notices taht there was a temporary overrun of > > the grant head space. > > > > Ok, I didn't expect this to be some catastrophic problem or really a > problem with your patch simply based on the lifetime of the code and how > the grant heads are actually used. I was going to suggest an assert or > something to detect whether batching behavior as a side effect of the > commit throttle would ever increase likelihood of this situation, but it > looks like the grant verify function somewhat serves that purpose > already. Yeah - xlog_verify_grant_tail() will the report reservation overruns, but the serious log space problems (i.e. head overwritting the tail) are detected by xlog_verify_tail_lsn() when we stamp the tail_lsn into the current iclog header. That's still done under the icloglock and the AIL lock, so the comparison of the tail with the current log head is still completely serialised. > I'd _prefer_ to see something, at least in DEBUG mode, that indicates > the frequency of the fundamental incorrect accounting condition as > opposed to just the side effect of blowing the tail (because the latter > depends on other difficult to reproduce factors), but I'd have to think > about that some more as it would need to balance against normal/expected > execution flow. Thanks for the background. You can test that just by removing the XLOG_TAIL_WARN flag setting, then it will warn on every reservation overrun rather than just the first. > > Hence occasional overruns on the reservation head before they start > > blocking isn't really a problem in practice because the probability > > of all the transaction reservation of all transactions running being > > required to make forwards progress is extremely small. > > > > Basically, we gave up "perfect reservation space grant accounting" > > because performance was extremely important and risk of log hangs as > > a result of overruns was considered to be extremely low and worth > > taking for the benefits the algorithm provided. This was just a > > simple, pragmatic risk based engineering decision. > > > > FWIW, the comment for xlog_verify_tail() also suggests the potential for > false positives and references a panic tag, which all seems kind of > erratic and misleading compared to what you explain here. Well, it's fundamentally an unserialised check, so it can race with other reservation grants, commits that release grant space and tail lsn updates. Hence it's not a 100% reliable debug check. It also used to be run at all times, not just under XFS_CONFIG_DEBUG=y, which is why it has a panic tag associated with it. When we first deployed it, we weren't 100% sure there weren't customer workloads that would trip over this and hang the log, so we gave ourselves a way of triggering kernel dumps the instant an overrun was detected. Hence a site that had log hangs with this message in the logs could turn on the panic tag and we'd get a kernel dump to analyse... Since then, this code has been relegated to debug code but the panic tag still exists. It could be turned back into a ASSERT now, but it's still useful the way it is as it means debug kernels don't fall over the moment a spurious overrun occurs... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-08 2:51 ` Dave Chinner @ 2019-10-08 13:22 ` Brian Foster 2019-10-08 17:34 ` Brian Foster 0 siblings, 1 reply; 23+ messages in thread From: Brian Foster @ 2019-10-08 13:22 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Oct 08, 2019 at 01:51:57PM +1100, Dave Chinner wrote: > On Fri, Oct 04, 2019 at 07:50:01AM -0400, Brian Foster wrote: > > On Fri, Oct 04, 2019 at 12:27:55PM +1000, Dave Chinner wrote: > > > On Thu, Oct 03, 2019 at 10:41:14AM -0400, Brian Foster wrote: > > > > Hmm, I'm also not sure the lockless reservation algorithm is totally > > > > immune to increased concurrency in this regard. What prevents multiple > > > > tasks from racing through xlog_grant_head_check() and blowing past the > > > > log head, for example? > > > > > > Nothing. Debug kernels even emit a "xlog_verify_grant_tail: space > > > > BBTOB(tail_blocks)" messages when that happens. It's pretty > > > difficult to do this in real world conditions, even when there is > > > lots of concurrency being used. > > > > > > > Hm, Ok. Though I've seen that alert enough times that I > > (unintentionally) ignore it at this point, so it can't be that hard to > > reproduce. ;) That is usually during fstests however, and not a typical > > workload that I recall. > > I can't say I've seen it for a long time now - I want to say "years" > but I may well have simply missed it on the rare occasion it has > occurred and fstests hasn't captured it. i.e. fstests is supposed to > capture unusual things like this appearing in dmesg during a > test.... > > > Of course, there's a difference between > > reproducing the basic condition and taking it to the point where it > > manifests into a problem. > > *nod* > > > > But here's the rub: it's not actually the end of the world because > > > the reservation doesn't actually determine how much of the log is > > > currently being used by running transactions - the reservation is > > > for a maximal rolling iteration of a permanent transaction, not the > > > initial transaction will be running. Hence if we overrun > > > occassionally we don't immediately run out of log space and corrupt > > > the log. > > > > > > > Ok, that much is evident from the amount of time this mechanism has been > > in place without any notable issues. > > > > > Yes, if none of the rolling transactions complete and they all need > > > to use their entire reservation, and the tail of the log cannot be > > > moved forward because it is pinned by one of the transactions that > > > is running, then we'll likely get a log hang on a regrant on the > > > write head. But if any of the transactions don't use all of their > > > reservation, then the overrun gets soaked up by the unused parts of > > > the transactions that are completed and returned to reservation > > > head, and nobody even notices taht there was a temporary overrun of > > > the grant head space. > > > > > > > Ok, I didn't expect this to be some catastrophic problem or really a > > problem with your patch simply based on the lifetime of the code and how > > the grant heads are actually used. I was going to suggest an assert or > > something to detect whether batching behavior as a side effect of the > > commit throttle would ever increase likelihood of this situation, but it > > looks like the grant verify function somewhat serves that purpose > > already. > > Yeah - xlog_verify_grant_tail() will the report reservation > overruns, but the serious log space problems (i.e. head overwritting > the tail) are detected by xlog_verify_tail_lsn() when we stamp the > tail_lsn into the current iclog header. That's still done under the > icloglock and the AIL lock, so the comparison of the tail with the > current log head is still completely serialised. > > > I'd _prefer_ to see something, at least in DEBUG mode, that indicates > > the frequency of the fundamental incorrect accounting condition as > > opposed to just the side effect of blowing the tail (because the latter > > depends on other difficult to reproduce factors), but I'd have to think > > about that some more as it would need to balance against normal/expected > > execution flow. Thanks for the background. > > You can test that just by removing the XLOG_TAIL_WARN flag setting, > then it will warn on every reservation overrun rather than just the > first. > That's not quite what I'm after. That just removes the oneshot nature of the existing check. The current debug check looks for a side effect of the current algorithm in the form of overrunning the tail. What I would like to see, somehow or another, is something that provides earlier and more useful information on the frequency/scale of occurrence where multiple reservations have been made based on the same grant baseline. This is not so much an error check so not something that should be an assert or warning, but rather more of a metric that provides a base for comparison whenever we might have code changes or workloads that potentially change this behavior. For example, consider a debug mode stat or sysfs file that could be used to dump out a counter of "same value" grant head samples after a benchmark workload or fstests run. The fact that this value might go up or down is not necessarily a problem. But that would provide some debug mode data to identify and analyze potentially unintended side effects like transient spikes in concurrent grant head checks due to blocking/scheduling changes in log reservation tasks. See the appended RFC for a quick idea of what I mean. This is slightly racy, but I think conservative enough to provide valid values with respect to already racy reservation implementation. On my slow 4xcpu vm, I see occasional sample counts of 2 running a -p8 fsstress. If I stick a smallish delay in xfs_log_reserve(), the frequency of the output increases and I see occasional bumps to 3, a spike of 8 in one case, etc. Of course I'd probably factor the atomic calls into DEBUG only inline functions that can be compiled out on production kernels and replace the tracepoint with a global counter (exported via stats or sysfs knob), but this just illustrates the idea. The global counter could also be replaced with (or kept in addition to) something that tracks a max concurrency value if that is more useful. Any thoughts on something like this? > > > Hence occasional overruns on the reservation head before they start > > > blocking isn't really a problem in practice because the probability > > > of all the transaction reservation of all transactions running being > > > required to make forwards progress is extremely small. > > > > > > Basically, we gave up "perfect reservation space grant accounting" > > > because performance was extremely important and risk of log hangs as > > > a result of overruns was considered to be extremely low and worth > > > taking for the benefits the algorithm provided. This was just a > > > simple, pragmatic risk based engineering decision. > > > > > > > FWIW, the comment for xlog_verify_tail() also suggests the potential for > > false positives and references a panic tag, which all seems kind of > > erratic and misleading compared to what you explain here. > > Well, it's fundamentally an unserialised check, so it can race with > other reservation grants, commits that release grant space and tail > lsn updates. Hence it's not a 100% reliable debug check. > Right. Yet there is a panic knob... > It also used to be run at all times, not just under > XFS_CONFIG_DEBUG=y, which is why it has a panic tag associated with > it. When we first deployed it, we weren't 100% sure there weren't > customer workloads that would trip over this and hang the log, so > we gave ourselves a way of triggering kernel dumps the instant an > overrun was detected. Hence a site that had log hangs with this > message in the logs could turn on the panic tag and we'd get a > kernel dump to analyse... > Ok, makes sense. This kind of speaks to the earlier point around having more useful data. While this isn't necessarily a problem right now, we have no real data to tell us whether some particular code change alters this behavior. If this was enough of a concern when the change was first put in place to insert a panic hook, then it stands to reason it's something we should at least be cognizant of when making changes that could impact its behavior. > Since then, this code has been relegated to debug code but the panic > tag still exists. It could be turned back into a ASSERT now, but > it's still useful the way it is as it means debug kernels don't fall > over the moment a spurious overrun occurs... > Yeah, ISTM the panic bit could be removed at this point. The warning (as opposed to an assert) is probably reasonable so long as the check itself is racy so as to not cause false positive panics with fatal asserts enabled. Brian --- 8< --- RFC: crude concurrency stat on reserve grant head diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index a2beee9f74da..1d3056176e6e 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -175,6 +175,7 @@ xlog_grant_head_init( xlog_assign_grant_head(&head->grant, 1, 0); INIT_LIST_HEAD(&head->waiters); spin_lock_init(&head->lock); + atomic64_set(&head->sample_cnt, 0); } STATIC void @@ -446,6 +447,7 @@ xfs_log_reserve( struct xlog_ticket *tic; int need_bytes; int error = 0; + int64_t sample_cnt; ASSERT(client == XFS_TRANSACTION || client == XFS_LOG); @@ -465,13 +467,19 @@ xfs_log_reserve( error = xlog_grant_head_check(log, &log->l_reserve_head, tic, &need_bytes); + atomic64_inc(&log->l_reserve_head.sample_cnt); if (error) goto out_error; + sample_cnt = atomic64_xchg(&log->l_reserve_head.sample_cnt, 0); xlog_grant_add_space(log, &log->l_reserve_head.grant, need_bytes); xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes); trace_xfs_log_reserve_exit(log, tic); xlog_verify_grant_tail(log); + + if (sample_cnt > 1) + trace_printk("%d: %lld\n", __LINE__, sample_cnt); + return 0; out_error: diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index b880c23cb6e4..62e4949f91c4 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -339,6 +339,7 @@ struct xlog_grant_head { spinlock_t lock ____cacheline_aligned_in_smp; struct list_head waiters; atomic64_t grant; + atomic64_t sample_cnt; }; /* ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-08 13:22 ` Brian Foster @ 2019-10-08 17:34 ` Brian Foster 0 siblings, 0 replies; 23+ messages in thread From: Brian Foster @ 2019-10-08 17:34 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Oct 08, 2019 at 09:22:40AM -0400, Brian Foster wrote: > On Tue, Oct 08, 2019 at 01:51:57PM +1100, Dave Chinner wrote: > > On Fri, Oct 04, 2019 at 07:50:01AM -0400, Brian Foster wrote: > > > On Fri, Oct 04, 2019 at 12:27:55PM +1000, Dave Chinner wrote: > > > > On Thu, Oct 03, 2019 at 10:41:14AM -0400, Brian Foster wrote: > > > > > Hmm, I'm also not sure the lockless reservation algorithm is totally > > > > > immune to increased concurrency in this regard. What prevents multiple > > > > > tasks from racing through xlog_grant_head_check() and blowing past the > > > > > log head, for example? > > > > > > > > Nothing. Debug kernels even emit a "xlog_verify_grant_tail: space > > > > > BBTOB(tail_blocks)" messages when that happens. It's pretty > > > > difficult to do this in real world conditions, even when there is > > > > lots of concurrency being used. > > > > > > > > > > Hm, Ok. Though I've seen that alert enough times that I > > > (unintentionally) ignore it at this point, so it can't be that hard to > > > reproduce. ;) That is usually during fstests however, and not a typical > > > workload that I recall. > > > > I can't say I've seen it for a long time now - I want to say "years" > > but I may well have simply missed it on the rare occasion it has > > occurred and fstests hasn't captured it. i.e. fstests is supposed to > > capture unusual things like this appearing in dmesg during a > > test.... > > > > > Of course, there's a difference between > > > reproducing the basic condition and taking it to the point where it > > > manifests into a problem. > > > > *nod* > > > > > > But here's the rub: it's not actually the end of the world because > > > > the reservation doesn't actually determine how much of the log is > > > > currently being used by running transactions - the reservation is > > > > for a maximal rolling iteration of a permanent transaction, not the > > > > initial transaction will be running. Hence if we overrun > > > > occassionally we don't immediately run out of log space and corrupt > > > > the log. > > > > > > > > > > Ok, that much is evident from the amount of time this mechanism has been > > > in place without any notable issues. > > > > > > > Yes, if none of the rolling transactions complete and they all need > > > > to use their entire reservation, and the tail of the log cannot be > > > > moved forward because it is pinned by one of the transactions that > > > > is running, then we'll likely get a log hang on a regrant on the > > > > write head. But if any of the transactions don't use all of their > > > > reservation, then the overrun gets soaked up by the unused parts of > > > > the transactions that are completed and returned to reservation > > > > head, and nobody even notices taht there was a temporary overrun of > > > > the grant head space. > > > > > > > > > > Ok, I didn't expect this to be some catastrophic problem or really a > > > problem with your patch simply based on the lifetime of the code and how > > > the grant heads are actually used. I was going to suggest an assert or > > > something to detect whether batching behavior as a side effect of the > > > commit throttle would ever increase likelihood of this situation, but it > > > looks like the grant verify function somewhat serves that purpose > > > already. > > > > Yeah - xlog_verify_grant_tail() will the report reservation > > overruns, but the serious log space problems (i.e. head overwritting > > the tail) are detected by xlog_verify_tail_lsn() when we stamp the > > tail_lsn into the current iclog header. That's still done under the > > icloglock and the AIL lock, so the comparison of the tail with the > > current log head is still completely serialised. > > > > > I'd _prefer_ to see something, at least in DEBUG mode, that indicates > > > the frequency of the fundamental incorrect accounting condition as > > > opposed to just the side effect of blowing the tail (because the latter > > > depends on other difficult to reproduce factors), but I'd have to think > > > about that some more as it would need to balance against normal/expected > > > execution flow. Thanks for the background. > > > > You can test that just by removing the XLOG_TAIL_WARN flag setting, > > then it will warn on every reservation overrun rather than just the > > first. > > > > That's not quite what I'm after. That just removes the oneshot nature of > the existing check. The current debug check looks for a side effect of > the current algorithm in the form of overrunning the tail. What I would > like to see, somehow or another, is something that provides earlier and > more useful information on the frequency/scale of occurrence where > multiple reservations have been made based on the same grant baseline. > > This is not so much an error check so not something that should be an > assert or warning, but rather more of a metric that provides a base for > comparison whenever we might have code changes or workloads that > potentially change this behavior. For example, consider a debug mode > stat or sysfs file that could be used to dump out a counter of "same > value" grant head samples after a benchmark workload or fstests run. The > fact that this value might go up or down is not necessarily a problem. > But that would provide some debug mode data to identify and analyze > potentially unintended side effects like transient spikes in concurrent > grant head checks due to blocking/scheduling changes in log reservation > tasks. > > See the appended RFC for a quick idea of what I mean. This is slightly > racy, but I think conservative enough to provide valid values with > respect to already racy reservation implementation. On my slow 4xcpu vm, > I see occasional sample counts of 2 running a -p8 fsstress. If I stick a > smallish delay in xfs_log_reserve(), the frequency of the output > increases and I see occasional bumps to 3, a spike of 8 in one case, > etc. Of course I'd probably factor the atomic calls into DEBUG only > inline functions that can be compiled out on production kernels and > replace the tracepoint with a global counter (exported via stats or > sysfs knob), but this just illustrates the idea. The global counter > could also be replaced with (or kept in addition to) something that > tracks a max concurrency value if that is more useful. Any thoughts on > something like this? > BTW, this doesn't necessarily have to be a task based counter. Another approach could be to keep a "lockless reservation slop" counter that is incremented by the ticket reservation size down in xlog_grant_head_check() (only in the lockless case) and flagged/decremented once the reservation is added to the grant head. Somewhere between the atomic add/sub we check the max slop value we've encountered and either track it and dump it in a sysfs file and/or just assert that it never reaches something unexpected/outrageous (based on some % of the log size, for example). Brian > > > > Hence occasional overruns on the reservation head before they start > > > > blocking isn't really a problem in practice because the probability > > > > of all the transaction reservation of all transactions running being > > > > required to make forwards progress is extremely small. > > > > > > > > Basically, we gave up "perfect reservation space grant accounting" > > > > because performance was extremely important and risk of log hangs as > > > > a result of overruns was considered to be extremely low and worth > > > > taking for the benefits the algorithm provided. This was just a > > > > simple, pragmatic risk based engineering decision. > > > > > > > > > > FWIW, the comment for xlog_verify_tail() also suggests the potential for > > > false positives and references a panic tag, which all seems kind of > > > erratic and misleading compared to what you explain here. > > > > Well, it's fundamentally an unserialised check, so it can race with > > other reservation grants, commits that release grant space and tail > > lsn updates. Hence it's not a 100% reliable debug check. > > > > Right. Yet there is a panic knob... > > > It also used to be run at all times, not just under > > XFS_CONFIG_DEBUG=y, which is why it has a panic tag associated with > > it. When we first deployed it, we weren't 100% sure there weren't > > customer workloads that would trip over this and hang the log, so > > we gave ourselves a way of triggering kernel dumps the instant an > > overrun was detected. Hence a site that had log hangs with this > > message in the logs could turn on the panic tag and we'd get a > > kernel dump to analyse... > > > > Ok, makes sense. This kind of speaks to the earlier point around having > more useful data. While this isn't necessarily a problem right now, we > have no real data to tell us whether some particular code change alters > this behavior. If this was enough of a concern when the change was first > put in place to insert a panic hook, then it stands to reason it's > something we should at least be cognizant of when making changes that > could impact its behavior. > > > Since then, this code has been relegated to debug code but the panic > > tag still exists. It could be turned back into a ASSERT now, but > > it's still useful the way it is as it means debug kernels don't fall > > over the moment a spurious overrun occurs... > > > > Yeah, ISTM the panic bit could be removed at this point. The warning (as > opposed to an assert) is probably reasonable so long as the check itself > is racy so as to not cause false positive panics with fatal asserts > enabled. > > Brian > > --- 8< --- > > RFC: crude concurrency stat on reserve grant head > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index a2beee9f74da..1d3056176e6e 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -175,6 +175,7 @@ xlog_grant_head_init( > xlog_assign_grant_head(&head->grant, 1, 0); > INIT_LIST_HEAD(&head->waiters); > spin_lock_init(&head->lock); > + atomic64_set(&head->sample_cnt, 0); > } > > STATIC void > @@ -446,6 +447,7 @@ xfs_log_reserve( > struct xlog_ticket *tic; > int need_bytes; > int error = 0; > + int64_t sample_cnt; > > ASSERT(client == XFS_TRANSACTION || client == XFS_LOG); > > @@ -465,13 +467,19 @@ xfs_log_reserve( > > error = xlog_grant_head_check(log, &log->l_reserve_head, tic, > &need_bytes); > + atomic64_inc(&log->l_reserve_head.sample_cnt); > if (error) > goto out_error; > > + sample_cnt = atomic64_xchg(&log->l_reserve_head.sample_cnt, 0); > xlog_grant_add_space(log, &log->l_reserve_head.grant, need_bytes); > xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes); > trace_xfs_log_reserve_exit(log, tic); > xlog_verify_grant_tail(log); > + > + if (sample_cnt > 1) > + trace_printk("%d: %lld\n", __LINE__, sample_cnt); > + > return 0; > > out_error: > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index b880c23cb6e4..62e4949f91c4 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -339,6 +339,7 @@ struct xlog_grant_head { > spinlock_t lock ____cacheline_aligned_in_smp; > struct list_head waiters; > atomic64_t grant; > + atomic64_t sample_cnt; > }; > > /* ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-09-30 21:53 ` Dave Chinner 2019-10-01 3:42 ` Dave Chinner @ 2019-10-01 13:13 ` Brian Foster 2019-10-01 22:31 ` Dave Chinner 1 sibling, 1 reply; 23+ messages in thread From: Brian Foster @ 2019-10-01 13:13 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote: > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote: > > On Mon, Sep 30, 2019 at 04:03:44PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > In certain situations the background CIL push can be indefinitely > > > delayed. While we have workarounds from the obvious cases now, it > > > doesn't solve the underlying issue. This issue is that there is no > > > upper limit on the CIL where we will either force or wait for > > > a background push to start, hence allowing the CIL to grow without > > > bound until it consumes all log space. > > > > > > To fix this, add a new wait queue to the CIL which allows background > > > pushes to wait for the CIL context to be switched out. This happens > > > when the push starts, so it will allow us to block incoming > > > transaction commit completion until the push has started. This will > > > only affect processes that are running modifications, and only when > > > the CIL threshold has been significantly overrun. > > > > > > This has no apparent impact on performance, and doesn't even trigger > > > until over 45 million inodes had been created in a 16-way fsmark > > > test on a 2GB log. That was limiting at 64MB of log space used, so > > > the active CIL size is only about 3% of the total log in that case. > > > The concurrent removal of those files did not trigger the background > > > sleep at all. > > > > > > > Have you done similar testing for small/minimum sized logs? > > Yes. I've had the tracepoint active during xfstests runs on test > filesystems using default log sizes on 5-15GB filesystems. The only > test in all of xfstests that has triggered it is generic/017, and it > only triggered once. > Ok, interesting. I guess it would be nice if we had a test that somehow or another more effectively exercised this codepath. > e.g. > > # trace-cmd start -e xfs_log_cil_wait > <run xfstests> > # trace-cmd show > # tracer: nop > # > # entries-in-buffer/entries-written: 1/1 #P:4 > # > # _-----=> irqs-off > # / _----=> need-resched > # | / _---=> hardirq/softirq > # || / _--=> preempt-depth > # ||| / delay > # TASK-PID CPU# |||| TIMESTAMP FUNCTION > # | | | |||| | | > xfs_io-2158 [001] ...1 309.285959: xfs_log_cil_wait: dev 8:96 t_ocnt 1 t_cnt 1 t_curr_res 67956 t_unit_res 67956 t_flags XLOG_TIC_INITED reserveq empty writeq empty grant_reserve_cycle 75 grant_reserve_bytes 12878480 grant_write_cycle 75 grant_write_bytes 12878480 curr_cycle 75 curr_block 10448 tail_cycle 75 tail_block 3560 > # > > And the timestamp matched the time that generic/017 was running. > > > Also, if this is so limited in occurrence, had you given any thought to > > something even more simple like flushing the CIL push workqueue when > > over the throttle threshold? > > Yes, I've tried that - flush_workqueue() blocks /everything/ until > all the queued work is complete. That means it waits for the CIL to > flush to disk and write a commit record, and every modification in > the filesystem is stopped dead in it's tracks until the CIL push is > complete. > > The problem is that flushing workqueues is a synchronous operation, > and it can't wait for partial work completion. We only need to wait > for the CIL context to be swapped out - this is done by the push > code before it starts all the expensive iclog formating and waiting > for iclog space... > I know it waits on the work to complete. I poked around a bit for an interface to "kick" a workqueue, so to speak (as opposed to flush), but I didn't see anything (not surprisingly, since it probably wouldn't be a broadly useful mechanism). That aside, where would this wait on the CIL to flush to disk? AFAICT the only thing that might happen here is log buffer submission. That only happens when the log buffer is full (regardless of the current CIL push writing its commit record). In fact if we did wait on I/O anywhere in here, then that suggests potential problems with async log force. > > That would wait longer to fill the iclogs, > > but would eliminate the need for another waitqueue that's apparently > > only used for throttling. It may also facilitate reuse of > > xlog_cil_push_now() in the >= XLOG_CIL_SPACE_BLOCKING_LIMIT() case > > (actually, I think this could facilitate the elimination of > > xlog_cil_push_background() entirely). > > xlog_cil_push_now() uses flush_work() to push any pending work > before it queues up the CIL flush that the caller is about to wait > for. i.e. the callers of xlog_cil_push_now() must ensure that all > CIL contexts are flushed for the purposes of a log force as they are > going to wait for all pending CIL flushes to complete. If we've > already pushed the CIL to the sequence that we are asking to push > to, we still have to wait for that previous push to be > done. This is what the flush_work() call in xlog_cil_push_now() > acheives. > Yes, I'm just exploring potential to reuse this code.. > > That aside... > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/xfs_log_cil.c | 37 +++++++++++++++++++++++++++++++++---- > > > fs/xfs/xfs_log_priv.h | 24 ++++++++++++++++++++++++ > > > fs/xfs/xfs_trace.h | 1 + > > > 3 files changed, 58 insertions(+), 4 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > > index ef652abd112c..4a09d50e1368 100644 > > > --- a/fs/xfs/xfs_log_cil.c > > > +++ b/fs/xfs/xfs_log_cil.c > > > @@ -670,6 +670,11 @@ xlog_cil_push( > > > push_seq = cil->xc_push_seq; > > > ASSERT(push_seq <= ctx->sequence); > > > > > > + /* > > > + * Wake up any background push waiters now this context is being pushed. > > > + */ > > > + wake_up_all(&ctx->push_wait); > > > + > > > /* > > > * Check if we've anything to push. If there is nothing, then we don't > > > * move on to a new sequence number and so we have to be able to push > > > @@ -746,6 +751,7 @@ xlog_cil_push( > > > */ > > > INIT_LIST_HEAD(&new_ctx->committing); > > > INIT_LIST_HEAD(&new_ctx->busy_extents); > > > + init_waitqueue_head(&new_ctx->push_wait); > > > new_ctx->sequence = ctx->sequence + 1; > > > new_ctx->cil = cil; > > > cil->xc_ctx = new_ctx; > > > @@ -900,7 +906,7 @@ xlog_cil_push_work( > > > */ > > > static void > > > xlog_cil_push_background( > > > - struct xlog *log) > > > + struct xlog *log) __releases(cil->xc_ctx_lock) > > > { > > > struct xfs_cil *cil = log->l_cilp; > > > > > > @@ -914,14 +920,36 @@ xlog_cil_push_background( > > > * don't do a background push if we haven't used up all the > > > * space available yet. > > > */ > > > - if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) > > > + if (cil->xc_ctx->space_used < XLOG_CIL_SPACE_LIMIT(log)) { > > > + up_read(&cil->xc_ctx_lock); > > > return; > > > + } > > > > > > spin_lock(&cil->xc_push_lock); > > > if (cil->xc_push_seq < cil->xc_current_sequence) { > > > cil->xc_push_seq = cil->xc_current_sequence; > > > queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); > > > } > > > + > > > + /* > > > + * Drop the context lock now, we can't hold that if we need to sleep > > > + * because we are over the blocking threshold. The push_lock is still > > > + * held, so blocking threshold sleep/wakeup is still correctly > > > + * serialised here. > > > + */ > > > + up_read(&cil->xc_ctx_lock); > > > + > > > + /* > > > + * If we are well over the space limit, throttle the work that is being > > > + * done until the push work on this context has begun. > > > + */ > > > + if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { > > > > Can we use consistent logic with the rest of the function? I.e. > > > > up_read(..); > > if (space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { > > spin_unlock(..); > > return; > > } > > I did it this way because code inside the if() statement is > typically pushed out of line by the compiler if it finished with a > return statement (i.e. the compiler makes an "unlikely" static > prediction). Hence it's best to put the slow path code inside thei > if(), not the fast path code. > It looks to me that occurs regardless of whether there's a return statement or not. AFAICT, this branching aready happens in the fast exit path of this function as well as for the queue_work() branch. So if it matters that much, perhaps the rest of the function should change appropriately.. > I can change it, but then we're into ugly likely()/unlikely() macro > territory and I much prefer to structure the code so things like > that are completely unnecessary. > Hasn't the [un]likely() stuff been considered bogus for a while now? Either way, we're already under spinlock at this point so this isn't a fast exit path. We're talking about branching to an exit path that still requires a function call (i.e. branch) to unlock the spinlock before we return. Combine that with the rest of the function and I'm having a hard time seeing how the factoring affects performance in practice. Maybe it does, but I guess I'd want to see some data to grok that the very least. > > > > ... > > xlog_wait(...); > > > > > + trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket); > > > + ASSERT(cil->xc_ctx->space_used < log->l_logsize); > > > + xlog_wait(&cil->xc_ctx->push_wait, &cil->xc_push_lock); > > > + return; > > > + } > > > + > > > > Also, I find it slightly annoying that even with all of this locking > > quirkiness we still manage to read ->space_used unprotected (now twice). > > IMO, the simple thing would be to let xlog_cil_insert_items() return the > > size immediately after the current transaction inserts and pass that > > into this function as a parameter. > > Yes, I noticed this and thought about it a bit, but it doesn't > really matter if the read is racy. It never really has mattered > because it's not a hard limit where accuracy is vitally important > for correct functioning. Locking and/or atomics are really only > necessary if we care about exact accuracy, but we care only about > avoiding fast path locking overhead as much as possible here. > > I'll deal with that in a separate patchset when I have some spare > time. > Fair enough, it's not technically a problem with this patch. > > > + * To prevent the CIL from overflowing upper commit size bounds, we introduce a > > > + * new threshold at which we block committing transactions until the background > > > + * CIL commit commences and switches to a new context. While this is not a hard > > > + * limit, it forces the process committing a transaction to the CIL to block and > > > + * yeild the CPU, giving the CIL push work a chance to be scheduled and start > > > + * work. This prevents a process running lots of transactions from overfilling > > > + * the CIL because it is not yielding the CPU. We set the blocking limit at > > > + * twice the background push space threshold so we keep in line with the AIL > > > + * push thresholds. > > > + * > > > + * Note: this is not a -hard- limit as blocking is applied after the transaction > > > + * is inserted into the CIL and the push has been triggered. It is largely a > > > + * throttling mechanism that allows the CIL push to be scheduled and run. A hard > > > + * limit will be difficult to implement without introducing global serialisation > > > + * in the CIL commit fast path, and it's not at all clear that we actually need > > > + * such hard limits given the ~7 years we've run without a hard limit before > > > + * finding the first situation where a checkpoint size overflow actually > > > + * occurred. Hence the simple throttle, and an ASSERT check to tell us that > > > + * we've overrun the max size. > > > */ > > > > I appreciate the extra documentation here, but I think most of the > > second paragraph is better commit log description material than > > something worth lengthening this already huge comment for. I'd suggest > > something like the following, but feel free to rework of course: > > Darrick asked for me include some of the information I had in the > commit description directly in the comment, so I did exaclty what he > asked.... > I don't see any reasoning about why this implementation was chosen over a hard limit in the commit log description (of this patch or the previous version). FWIW, I think it makes perfect sense to include it in the commit log description. > > * > > * Since the CIL push threshold only triggers a background push, a > > * second threshold triggers transaction commit blocking until the > > * background push initiates and switches the CIL context. The blocking > > * threshold is set to twice the background push threshold to keep in > > * line with AIL push thresholds. Note that this is not a CIL context > > * size limit. This is a throttle threshold to slow the growth of the > > * context and yield the CPU for a background push under overload > > * conditions. > > Sure, but that doesn't tell us *why* it is implemented as a throttle > threshold instead of a hard limit. This text tells us what the code > does, not why it was implemented that way. I'll let darrick decide > on this. > I generally don't find value in code comments that explain why we didn't do X or Y in code that ultimately does Z (that kind of thing belongs in the commit log description IMO), but if somebody finds the content useful then it's not a big deal to me. Brian > Cheers, > > Dave. > > > > */ > > > > Brian > > > > > #define XLOG_CIL_SPACE_LIMIT(log) \ > > > min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4) > > > > > > +#define XLOG_CIL_BLOCKING_SPACE_LIMIT(log) \ > > > + (XLOG_CIL_SPACE_LIMIT(log) * 2) > > > + > > > /* > > > * ticket grant locks, queues and accounting have their own cachlines > > > * as these are quite hot and can be operated on concurrently. > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > > > index eaae275ed430..e7087ede2662 100644 > > > --- a/fs/xfs/xfs_trace.h > > > +++ b/fs/xfs/xfs_trace.h > > > @@ -1011,6 +1011,7 @@ DEFINE_LOGGRANT_EVENT(xfs_log_regrant_reserve_sub); > > > DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_enter); > > > DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_exit); > > > DEFINE_LOGGRANT_EVENT(xfs_log_ungrant_sub); > > > +DEFINE_LOGGRANT_EVENT(xfs_log_cil_wait); > > > > > > DECLARE_EVENT_CLASS(xfs_log_item_class, > > > TP_PROTO(struct xfs_log_item *lip), > > > -- > > > 2.23.0.rc1 > > > > > > > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-01 13:13 ` Brian Foster @ 2019-10-01 22:31 ` Dave Chinner 2019-10-02 12:40 ` Brian Foster 0 siblings, 1 reply; 23+ messages in thread From: Dave Chinner @ 2019-10-01 22:31 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Tue, Oct 01, 2019 at 09:13:04AM -0400, Brian Foster wrote: > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote: > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote: > > > Have you done similar testing for small/minimum sized logs? > > > > Yes. I've had the tracepoint active during xfstests runs on test > > filesystems using default log sizes on 5-15GB filesystems. The only > > test in all of xfstests that has triggered it is generic/017, and it > > only triggered once. > > > > Ok, interesting. I guess it would be nice if we had a test that somehow > or another more effectively exercised this codepath. *nod* But it's essentially difficult to predict in any way because behaviour is not just a function of log size. :/ > > > Also, if this is so limited in occurrence, had you given any thought to > > > something even more simple like flushing the CIL push workqueue when > > > over the throttle threshold? > > > > Yes, I've tried that - flush_workqueue() blocks /everything/ until > > all the queued work is complete. That means it waits for the CIL to > > flush to disk and write a commit record, and every modification in > > the filesystem is stopped dead in it's tracks until the CIL push is > > complete. > > > > The problem is that flushing workqueues is a synchronous operation, > > and it can't wait for partial work completion. We only need to wait > > for the CIL context to be swapped out - this is done by the push > > code before it starts all the expensive iclog formating and waiting > > for iclog space... > > > > I know it waits on the work to complete. I poked around a bit for an > interface to "kick" a workqueue, so to speak (as opposed to flush), but > I didn't see anything (not surprisingly, since it probably wouldn't be a > broadly useful mechanism). *nod* > That aside, where would this wait on the CIL to flush to disk? AFAICT > the only thing that might happen here is log buffer submission. That > only happens when the log buffer is full (regardless of the current CIL > push writing its commit record). In fact if we did wait on I/O anywhere > in here, then that suggests potential problems with async log force. There is no such thing as a "async log force". The log force always waits on the CIL flush - XFS_LOG_SYNC only defines whether it waits on all iclogbufs post CIL flush to hit the disk. Further, when the CIL flushes, it's normally flushing more metadata that we can hold in 8x iclogbufs. The default is 32kB iclogbufs, so if we've got more than 256kB of checkpoint data to be written, then we end up waiting on iclogbuf completion to write more then 256kB. Typically I see a sustainted ratio of roughly 4 iclog writes to 1 noiclogbufs in the metric graphs on small logs - just measured 700 log writes/s, 250 noiclogs/s for a 64MB log and 256kB logbuf size. IOWs, CIL flushes are regularly waiting in xlog_get_iclog_space() on iclog IO completion to occur... > > > That would wait longer to fill the iclogs, > > > but would eliminate the need for another waitqueue that's apparently > > > only used for throttling. It may also facilitate reuse of > > > xlog_cil_push_now() in the >= XLOG_CIL_SPACE_BLOCKING_LIMIT() case > > > (actually, I think this could facilitate the elimination of > > > xlog_cil_push_background() entirely). > > > > xlog_cil_push_now() uses flush_work() to push any pending work > > before it queues up the CIL flush that the caller is about to wait > > for. i.e. the callers of xlog_cil_push_now() must ensure that all > > CIL contexts are flushed for the purposes of a log force as they are > > going to wait for all pending CIL flushes to complete. If we've > > already pushed the CIL to the sequence that we are asking to push > > to, we still have to wait for that previous push to be > > done. This is what the flush_work() call in xlog_cil_push_now() > > acheives. > > > > Yes, I'm just exploring potential to reuse this code.. Yeah, I have a few prototype patches for revamping this, including an actual async CIL flush. I do some work here, but it didn't solve any of the problems I needed to fix so it put it aside. See below. > > > > + /* > > > > + * If we are well over the space limit, throttle the work that is being > > > > + * done until the push work on this context has begun. > > > > + */ > > > > + if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { > > > > > > Can we use consistent logic with the rest of the function? I.e. > > > > > > up_read(..); > > > if (space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { > > > spin_unlock(..); > > > return; > > > } > > > > I did it this way because code inside the if() statement is > > typically pushed out of line by the compiler if it finished with a > > return statement (i.e. the compiler makes an "unlikely" static > > prediction). Hence it's best to put the slow path code inside thei > > if(), not the fast path code. > > > > It looks to me that occurs regardless of whether there's a return > statement or not. AFAICT, this branching aready happens in the fast exit > path of this function as well as for the queue_work() branch. So if it > matters that much, perhaps the rest of the function should change > appropriately.. > > > I can change it, but then we're into ugly likely()/unlikely() macro > > territory and I much prefer to structure the code so things like > > that are completely unnecessary. > > > > Hasn't the [un]likely() stuff been considered bogus for a while now? Yes, mainly because they are often wrongi and often the compiler and hardware do a much better job with static and/or dynamic prediction. Nobody ever profiles the branch predictions to check the manual annotations are correct and valid, and code can change over time meaning the static prediction is wrong... > Either way, we're already under spinlock at this point so this isn't a > fast exit path. We're talking about branching to an exit path that > still requires a function call (i.e. branch) to unlock the spinlock > before we return. Combine that with the rest of the function and I'm > having a hard time seeing how the factoring affects performance in > practice. Maybe it does, but I guess I'd want to see some data to grok > that the very least. I'm not wedded to a specific layout, just explaining the reasoning. I'll change it for the next revision... Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: Don't block tail pushing on CIL forces From: Dave Chinner <dchinner@redhat.com> When AIL tail pushing encounters a pinned item, it forces the log and waits for it to complete before it starts walking items again. However, in forcing the log it waits for all the dirty objects in memory to be written to the log, hence waiting for (potentially) a lot of log IO to complete when the CIL contains thousands of dirty objects. This can block processing of the AIL for seconds and prevent the tail of the log from being moved forward while the AIL is blocked. To prevent the log tail pushing from waiting on log flushes, just trigger a CIL push from the xfsaild instead of a full log force. This will get the majority of the items pinned moving via teh CIL workqueue and so won't cause the xfsaild to wait for it to complete before it can do it's own work. Further, the AIL has it's own backoff waits, so move the CIL push to before the AIL backoff wait rather than afterwards, so the CIL has time to do some work before the AIL will start scanning again. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log_cil.c | 13 ++++++++----- fs/xfs/xfs_log_priv.h | 22 ++++++++++++++++++---- fs/xfs/xfs_trans_ail.c | 29 +++++++++++++++-------------- 3 files changed, 41 insertions(+), 23 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 2f4cfd5b707e..a713f58e9e86 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -931,7 +931,7 @@ xlog_cil_push_background( * @push_seq, but it won't be completed. The caller is expected to do any * waiting for push_seq to complete if it is required. */ -static void +void xlog_cil_push_now( struct xlog *log, xfs_lsn_t push_seq) @@ -941,21 +941,24 @@ xlog_cil_push_now( if (!cil) return; - ASSERT(push_seq && push_seq <= cil->xc_current_sequence); - /* * If the CIL is empty or we've already pushed the sequence then - * there's no work we need to do. + * there's no work we need to do. If push_seq is not set, someone is + * just giving us the hurry up (e.g. to unpin objects without otherwise + * waiting on completion) so just push the current context. */ spin_lock(&cil->xc_push_lock); + if (!push_seq) + push_seq = cil->xc_current_sequence; + if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) { spin_unlock(&cil->xc_push_lock); return; } cil->xc_push_seq = push_seq; - spin_unlock(&cil->xc_push_lock); queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); + spin_unlock(&cil->xc_push_lock); } bool diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index f66043853310..7842f923b3a1 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -520,11 +520,25 @@ bool xlog_cil_empty(struct xlog *log); /* * CIL force routines + * + * If you just want to schedule a CIL context flush to get things moving but + * not wait for anything (not an integrity operation!), use + * + * xlog_cil_push_now(log, 0); + * + * Blocking or integrity based CIL forces should not be called directly, but + * instead be marshalled through xfs_log_force(). i.e. if you want to: + * + * - flush and wait for the entire CIL to be submitted to the log, use + * xfs_log_force(mp, 0); + * + * - flush and wait for the CIL to be stable on disk and all items to be + * inserted into the AIL, use + * xfs_log_force(mp, XFS_LOG_SYNC); */ -xfs_lsn_t -xlog_cil_force_lsn( - struct xlog *log, - xfs_lsn_t sequence); +void xlog_cil_push_now(struct xlog *log, xfs_lsn_t sequence); + +xfs_lsn_t xlog_cil_force_lsn(struct xlog *log, xfs_lsn_t sequence); static inline void xlog_cil_force(struct xlog *log) diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 5802139f786b..c9d8d05c3594 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -13,6 +13,7 @@ #include "xfs_mount.h" #include "xfs_trans.h" #include "xfs_trans_priv.h" +#include "xfs_log_priv.h" #include "xfs_trace.h" #include "xfs_errortag.h" #include "xfs_error.h" @@ -381,20 +382,6 @@ xfsaild_push( int flushing = 0; int count = 0; - /* - * If we encountered pinned items or did not finish writing out all - * buffers the last time we ran, force the log first and wait for it - * before pushing again. - */ - if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 && - (!list_empty_careful(&ailp->ail_buf_list) || - xfs_ail_min_lsn(ailp))) { - ailp->ail_log_flush = 0; - - XFS_STATS_INC(mp, xs_push_ail_flush); - xfs_log_force(mp, XFS_LOG_SYNC); - } - spin_lock(&ailp->ail_lock); /* barrier matches the ail_target update in xfs_ail_push() */ @@ -528,6 +515,20 @@ xfsaild_push( tout = 10; } + /* + * If we encountered pinned items or did not finish writing out all + * buffers the last time we ran, give the CIL a nudge to start + * unpinning the blocked items while we wait for a while... + */ + if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 && + (!list_empty_careful(&ailp->ail_buf_list) || + xfs_ail_min_lsn(ailp))) { + ailp->ail_log_flush = 0; + + XFS_STATS_INC(mp, xs_push_ail_flush); + xlog_cil_push_now(mp->m_log, 0); + } + return tout; } ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-01 22:31 ` Dave Chinner @ 2019-10-02 12:40 ` Brian Foster 2019-10-03 0:53 ` Dave Chinner 0 siblings, 1 reply; 23+ messages in thread From: Brian Foster @ 2019-10-02 12:40 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Wed, Oct 02, 2019 at 08:31:07AM +1000, Dave Chinner wrote: > On Tue, Oct 01, 2019 at 09:13:04AM -0400, Brian Foster wrote: > > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote: > > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote: > > > > Have you done similar testing for small/minimum sized logs? > > > > > > Yes. I've had the tracepoint active during xfstests runs on test > > > filesystems using default log sizes on 5-15GB filesystems. The only > > > test in all of xfstests that has triggered it is generic/017, and it > > > only triggered once. > > > > > > > Ok, interesting. I guess it would be nice if we had a test that somehow > > or another more effectively exercised this codepath. > > *nod* > > But it's essentially difficult to predict in any way because > behaviour is not just a function of log size. :/ > > > > > Also, if this is so limited in occurrence, had you given any thought to > > > > something even more simple like flushing the CIL push workqueue when > > > > over the throttle threshold? > > > > > > Yes, I've tried that - flush_workqueue() blocks /everything/ until > > > all the queued work is complete. That means it waits for the CIL to > > > flush to disk and write a commit record, and every modification in > > > the filesystem is stopped dead in it's tracks until the CIL push is > > > complete. > > > > > > The problem is that flushing workqueues is a synchronous operation, > > > and it can't wait for partial work completion. We only need to wait > > > for the CIL context to be swapped out - this is done by the push > > > code before it starts all the expensive iclog formating and waiting > > > for iclog space... > > > > > > > I know it waits on the work to complete. I poked around a bit for an > > interface to "kick" a workqueue, so to speak (as opposed to flush), but > > I didn't see anything (not surprisingly, since it probably wouldn't be a > > broadly useful mechanism). > > *nod* > > > That aside, where would this wait on the CIL to flush to disk? AFAICT > > the only thing that might happen here is log buffer submission. That > > only happens when the log buffer is full (regardless of the current CIL > > push writing its commit record). In fact if we did wait on I/O anywhere > > in here, then that suggests potential problems with async log force. > > There is no such thing as a "async log force". The log force always > waits on the CIL flush - XFS_LOG_SYNC only defines whether it waits > on all iclogbufs post CIL flush to hit the disk. > I'm just referring to the semantics/naming of the existing interface. I suppose I could have used "a log force that doesn't wait on all iclogbufs to hit the disk," but that doesn't quite roll off the tongue. ;) > Further, when the CIL flushes, it's normally flushing more metadata that we > can hold in 8x iclogbufs. The default is 32kB iclogbufs, so if we've > got more than 256kB of checkpoint data to be written, then we end up > waiting on iclogbuf completion to write more then 256kB. > > Typically I see a sustainted ratio of roughly 4 iclog writes to 1 > noiclogbufs in the metric graphs on small logs - just measured 700 > log writes/s, 250 noiclogs/s for a 64MB log and 256kB logbuf size. > IOWs, CIL flushes are regularly waiting in xlog_get_iclog_space() on > iclog IO completion to occur... > Ok, that's not quite what I was concerned about when you mentioned waiting on the CIL to flush to disk. No matter, the important bit here is the performance cost of including the extra blocking on log I/O (to cycle iclogs back to active for reuse) in the throttle. I'm curious about how noticeable this extra blocking would be because it's one likely cause of the CIL pressure buildup in the first place. My original tests reproduced huge CIL checkpoints purely based on one CIL push being blocked behind the processing of another, the latter taking relatively more time due to log I/O. This is not to say there aren't other causes of excessively sized checkpoints. Rather, if we're at a point where we've blocked transactions on this new threshold, that means we've doubled the background threshold in the time we've first triggered a background CIL push and the push actually started. From that, it seems fairly likely that we could replenish the CIL to the background threshold once threads are unblocked but before the previous push completes. The question is: can we get all the way to the blocking threshold before that happens? That doesn't seem unrealistic to me, but it's hard to reason about without having tested it. If so, I think it means we end up blocking on completion of the first push to some degree anyways. > > > > That would wait longer to fill the iclogs, > > > > but would eliminate the need for another waitqueue that's apparently > > > > only used for throttling. It may also facilitate reuse of > > > > xlog_cil_push_now() in the >= XLOG_CIL_SPACE_BLOCKING_LIMIT() case > > > > (actually, I think this could facilitate the elimination of > > > > xlog_cil_push_background() entirely). > > > > > > xlog_cil_push_now() uses flush_work() to push any pending work > > > before it queues up the CIL flush that the caller is about to wait > > > for. i.e. the callers of xlog_cil_push_now() must ensure that all > > > CIL contexts are flushed for the purposes of a log force as they are > > > going to wait for all pending CIL flushes to complete. If we've > > > already pushed the CIL to the sequence that we are asking to push > > > to, we still have to wait for that previous push to be > > > done. This is what the flush_work() call in xlog_cil_push_now() > > > acheives. > > > > > > > Yes, I'm just exploring potential to reuse this code.. > > Yeah, I have a few prototype patches for revamping this, including > an actual async CIL flush. I do some work here, but it didn't solve > any of the problems I needed to fix so it put it aside. See below. > That sounds more involved than what I was thinking. My thought is that this throttle is already not predictable or deterministic (i.e. we're essentially waiting on a scheduler event) and so might not require the extra complexity of a new waitqueue. It certainly could be the case that blocking on the entire push is just too long in practice, but since this is already based on empirical evidence and subject to unpredictability, ISTM that testing is the only way to know for sure. For reference, I hacked up something to reuse xlog_cil_push_now() for background pushing and throttling that ends up removing 20 or so lines of code by the time it's in place, but I haven't given it any testing. That said, this is just an observation and an idea. I'm fine with the proposed implementation with the other nits and whatnot fixed up. Brian > > > > > + /* > > > > > + * If we are well over the space limit, throttle the work that is being > > > > > + * done until the push work on this context has begun. > > > > > + */ > > > > > + if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { > > > > > > > > Can we use consistent logic with the rest of the function? I.e. > > > > > > > > up_read(..); > > > > if (space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) { > > > > spin_unlock(..); > > > > return; > > > > } > > > > > > I did it this way because code inside the if() statement is > > > typically pushed out of line by the compiler if it finished with a > > > return statement (i.e. the compiler makes an "unlikely" static > > > prediction). Hence it's best to put the slow path code inside thei > > > if(), not the fast path code. > > > > > > > It looks to me that occurs regardless of whether there's a return > > statement or not. AFAICT, this branching aready happens in the fast exit > > path of this function as well as for the queue_work() branch. So if it > > matters that much, perhaps the rest of the function should change > > appropriately.. > > > > > I can change it, but then we're into ugly likely()/unlikely() macro > > > territory and I much prefer to structure the code so things like > > > that are completely unnecessary. > > > > > > > Hasn't the [un]likely() stuff been considered bogus for a while now? > > Yes, mainly because they are often wrongi and often the compiler and > hardware do a much better job with static and/or dynamic prediction. > Nobody ever profiles the branch predictions to check the manual > annotations are correct and valid, and code can change over time > meaning the static prediction is wrong... > > > Either way, we're already under spinlock at this point so this isn't a > > fast exit path. We're talking about branching to an exit path that > > still requires a function call (i.e. branch) to unlock the spinlock > > before we return. Combine that with the rest of the function and I'm > > having a hard time seeing how the factoring affects performance in > > practice. Maybe it does, but I guess I'd want to see some data to grok > > that the very least. > > I'm not wedded to a specific layout, just explaining the reasoning. > I'll change it for the next revision... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > xfs: Don't block tail pushing on CIL forces > > From: Dave Chinner <dchinner@redhat.com> > > When AIL tail pushing encounters a pinned item, it forces the log > and waits for it to complete before it starts walking items again. > However, in forcing the log it waits for all the dirty objects in > memory to be written to the log, hence waiting for (potentially) a > lot of log IO to complete when the CIL contains thousands of dirty > objects. This can block processing of the AIL for seconds and > prevent the tail of the log from being moved forward while the AIL > is blocked. > > To prevent the log tail pushing from waiting on log flushes, just > trigger a CIL push from the xfsaild instead of a full log force. > This will get the majority of the items pinned moving via teh CIL > workqueue and so won't cause the xfsaild to wait for it to complete > before it can do it's own work. > > Further, the AIL has it's own backoff waits, so move the CIL push > to before the AIL backoff wait rather than afterwards, so the CIL > has time to do some work before the AIL will start scanning again. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_log_cil.c | 13 ++++++++----- > fs/xfs/xfs_log_priv.h | 22 ++++++++++++++++++---- > fs/xfs/xfs_trans_ail.c | 29 +++++++++++++++-------------- > 3 files changed, 41 insertions(+), 23 deletions(-) > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 2f4cfd5b707e..a713f58e9e86 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -931,7 +931,7 @@ xlog_cil_push_background( > * @push_seq, but it won't be completed. The caller is expected to do any > * waiting for push_seq to complete if it is required. > */ > -static void > +void > xlog_cil_push_now( > struct xlog *log, > xfs_lsn_t push_seq) > @@ -941,21 +941,24 @@ xlog_cil_push_now( > if (!cil) > return; > > - ASSERT(push_seq && push_seq <= cil->xc_current_sequence); > - > /* > * If the CIL is empty or we've already pushed the sequence then > - * there's no work we need to do. > + * there's no work we need to do. If push_seq is not set, someone is > + * just giving us the hurry up (e.g. to unpin objects without otherwise > + * waiting on completion) so just push the current context. > */ > spin_lock(&cil->xc_push_lock); > + if (!push_seq) > + push_seq = cil->xc_current_sequence; > + > if (list_empty(&cil->xc_cil) || push_seq <= cil->xc_push_seq) { > spin_unlock(&cil->xc_push_lock); > return; > } > > cil->xc_push_seq = push_seq; > - spin_unlock(&cil->xc_push_lock); > queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); > + spin_unlock(&cil->xc_push_lock); > } > > bool > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index f66043853310..7842f923b3a1 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -520,11 +520,25 @@ bool xlog_cil_empty(struct xlog *log); > > /* > * CIL force routines > + * > + * If you just want to schedule a CIL context flush to get things moving but > + * not wait for anything (not an integrity operation!), use > + * > + * xlog_cil_push_now(log, 0); > + * > + * Blocking or integrity based CIL forces should not be called directly, but > + * instead be marshalled through xfs_log_force(). i.e. if you want to: > + * > + * - flush and wait for the entire CIL to be submitted to the log, use > + * xfs_log_force(mp, 0); > + * > + * - flush and wait for the CIL to be stable on disk and all items to be > + * inserted into the AIL, use > + * xfs_log_force(mp, XFS_LOG_SYNC); > */ > -xfs_lsn_t > -xlog_cil_force_lsn( > - struct xlog *log, > - xfs_lsn_t sequence); > +void xlog_cil_push_now(struct xlog *log, xfs_lsn_t sequence); > + > +xfs_lsn_t xlog_cil_force_lsn(struct xlog *log, xfs_lsn_t sequence); > > static inline void > xlog_cil_force(struct xlog *log) > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c > index 5802139f786b..c9d8d05c3594 100644 > --- a/fs/xfs/xfs_trans_ail.c > +++ b/fs/xfs/xfs_trans_ail.c > @@ -13,6 +13,7 @@ > #include "xfs_mount.h" > #include "xfs_trans.h" > #include "xfs_trans_priv.h" > +#include "xfs_log_priv.h" > #include "xfs_trace.h" > #include "xfs_errortag.h" > #include "xfs_error.h" > @@ -381,20 +382,6 @@ xfsaild_push( > int flushing = 0; > int count = 0; > > - /* > - * If we encountered pinned items or did not finish writing out all > - * buffers the last time we ran, force the log first and wait for it > - * before pushing again. > - */ > - if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 && > - (!list_empty_careful(&ailp->ail_buf_list) || > - xfs_ail_min_lsn(ailp))) { > - ailp->ail_log_flush = 0; > - > - XFS_STATS_INC(mp, xs_push_ail_flush); > - xfs_log_force(mp, XFS_LOG_SYNC); > - } > - > spin_lock(&ailp->ail_lock); > > /* barrier matches the ail_target update in xfs_ail_push() */ > @@ -528,6 +515,20 @@ xfsaild_push( > tout = 10; > } > > + /* > + * If we encountered pinned items or did not finish writing out all > + * buffers the last time we ran, give the CIL a nudge to start > + * unpinning the blocked items while we wait for a while... > + */ > + if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 && > + (!list_empty_careful(&ailp->ail_buf_list) || > + xfs_ail_min_lsn(ailp))) { > + ailp->ail_log_flush = 0; > + > + XFS_STATS_INC(mp, xs_push_ail_flush); > + xlog_cil_push_now(mp->m_log, 0); > + } > + > return tout; > } > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-02 12:40 ` Brian Foster @ 2019-10-03 0:53 ` Dave Chinner 2019-10-03 14:39 ` Brian Foster 0 siblings, 1 reply; 23+ messages in thread From: Dave Chinner @ 2019-10-03 0:53 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Wed, Oct 02, 2019 at 08:40:56AM -0400, Brian Foster wrote: > On Wed, Oct 02, 2019 at 08:31:07AM +1000, Dave Chinner wrote: > > On Tue, Oct 01, 2019 at 09:13:04AM -0400, Brian Foster wrote: > > > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote: > > > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote: > > > > > Have you done similar testing for small/minimum sized logs? > > > > > > > > Yes. I've had the tracepoint active during xfstests runs on test > > > > filesystems using default log sizes on 5-15GB filesystems. The only > > > > test in all of xfstests that has triggered it is generic/017, and it > > > > only triggered once. > > > > > > > > > > Ok, interesting. I guess it would be nice if we had a test that somehow > > > or another more effectively exercised this codepath. > > > > *nod* > > > > But it's essentially difficult to predict in any way because > > behaviour is not just a function of log size. :/ > > > > > > > Also, if this is so limited in occurrence, had you given any thought to > > > > > something even more simple like flushing the CIL push workqueue when > > > > > over the throttle threshold? > > > > > > > > Yes, I've tried that - flush_workqueue() blocks /everything/ until > > > > all the queued work is complete. That means it waits for the CIL to > > > > flush to disk and write a commit record, and every modification in > > > > the filesystem is stopped dead in it's tracks until the CIL push is > > > > complete. > > > > > > > > The problem is that flushing workqueues is a synchronous operation, > > > > and it can't wait for partial work completion. We only need to wait > > > > for the CIL context to be swapped out - this is done by the push > > > > code before it starts all the expensive iclog formating and waiting > > > > for iclog space... > > > > > > > > > > I know it waits on the work to complete. I poked around a bit for an > > > interface to "kick" a workqueue, so to speak (as opposed to flush), but > > > I didn't see anything (not surprisingly, since it probably wouldn't be a > > > broadly useful mechanism). > > > > *nod* > > > > > That aside, where would this wait on the CIL to flush to disk? AFAICT > > > the only thing that might happen here is log buffer submission. That > > > only happens when the log buffer is full (regardless of the current CIL > > > push writing its commit record). In fact if we did wait on I/O anywhere > > > in here, then that suggests potential problems with async log force. > > > > There is no such thing as a "async log force". The log force always > > waits on the CIL flush - XFS_LOG_SYNC only defines whether it waits > > on all iclogbufs post CIL flush to hit the disk. > > > > I'm just referring to the semantics/naming of the existing interface. I > suppose I could have used "a log force that doesn't wait on all > iclogbufs to hit the disk," but that doesn't quite roll off the tongue. > ;) *nod* Just wanted to make sure we were both talking about the same thing :) > > > Further, when the CIL flushes, it's normally flushing more metadata that we > > can hold in 8x iclogbufs. The default is 32kB iclogbufs, so if we've > > got more than 256kB of checkpoint data to be written, then we end up > > waiting on iclogbuf completion to write more then 256kB. > > > > Typically I see a sustainted ratio of roughly 4 iclog writes to 1 > > noiclogbufs in the metric graphs on small logs - just measured 700 > > log writes/s, 250 noiclogs/s for a 64MB log and 256kB logbuf size. > > IOWs, CIL flushes are regularly waiting in xlog_get_iclog_space() on > > iclog IO completion to occur... > > > > Ok, that's not quite what I was concerned about when you mentioned > waiting on the CIL to flush to disk. No matter, the important bit here > is the performance cost of including the extra blocking on log I/O (to > cycle iclogs back to active for reuse) in the throttle. > > I'm curious about how noticeable this extra blocking would be because > it's one likely cause of the CIL pressure buildup in the first place. My > original tests reproduced huge CIL checkpoints purely based on one CIL > push being blocked behind the processing of another, the latter taking > relatively more time due to log I/O. It's very noticable. I dropped that as a throttle mechanism because even on a SSD at 200us per write, a 32MB CIL flush takes for 20-30ms. And it stops everything dead for that time because it stalls all async transaction commits while the log flushes. When several CIL flushes a second occur (pushing 100-150MB/s to the log), I start seeing a substantial amount of additional time (15-20% of total CPU time) being spent idle just waiting for CIL flush completion. And it only gets worse as the storage gets slower... The throttle that these patches implement are typically only stalling incoming transactions for a couple of hundred microseconds. The timestamps from the groups of log events show the blocking of all threads are very close together in time, and the push work to switch over to the new context to allow them to unblock and continue happens within another 200-300 microseconds. And the worst case I've seen, this is happening once or twice a second. IOWs, the blocking time of the throttle is very short and largely unnoticable, and the difference to performance it causes is far, far less than the noise threshold of the benchmarks. > This is not to say there aren't other causes of excessively sized > checkpoints. Rather, if we're at a point where we've blocked > transactions on this new threshold, that means we've doubled the > background threshold in the time we've first triggered a background CIL > push and the push actually started. From that, it seems fairly likely > that we could replenish the CIL to the background threshold once > threads are unblocked but before the previous push completes. That's just fine. The CIL is actually designed to allow that sort of overlap between multiple committing background contexts and the current aggregating context. As long as each individual CIL commit size doesn't go over half the log and there's always a complete commit in the log (which there should be via head/tail limits), the only limit on the number of concurrent committing CIL contexts on the fly at once is the amount of log space we have available.... > The question is: can we get all the way to the blocking threshold before > that happens? That doesn't seem unrealistic to me, but it's hard to > reason about without having tested it. If so, I think it means we end up > blocking on completion of the first push to some degree anyways. Yes, we can fill and push the current sequence before the previous sequence has finished committing. > > > > xlog_cil_push_now() uses flush_work() to push any pending work > > > > before it queues up the CIL flush that the caller is about to wait > > > > for. i.e. the callers of xlog_cil_push_now() must ensure that all > > > > CIL contexts are flushed for the purposes of a log force as they are > > > > going to wait for all pending CIL flushes to complete. If we've > > > > already pushed the CIL to the sequence that we are asking to push > > > > to, we still have to wait for that previous push to be > > > > done. This is what the flush_work() call in xlog_cil_push_now() > > > > acheives. > > > > > > > > > > Yes, I'm just exploring potential to reuse this code.. > > > > Yeah, I have a few prototype patches for revamping this, including > > an actual async CIL flush. I do some work here, but it didn't solve > > any of the problems I needed to fix so it put it aside. See below. > > > > That sounds more involved than what I was thinking. My thought is that > this throttle is already not predictable or deterministic (i.e. we're > essentially waiting on a scheduler event) and so might not require the > extra complexity of a new waitqueue. It certainly could be the case that > blocking on the entire push is just too long in practice, but since this I've made that sort of change here and measured the regression. Then I discarded it as an unworkable option and looked for other solutions. > is already based on empirical evidence and subject to unpredictability, > ISTM that testing is the only way to know for sure. For reference, I > hacked up something to reuse xlog_cil_push_now() for background pushing > and throttling that ends up removing 20 or so lines of code by the time > it's in place, but I haven't given it any testing. > > That said, this is just an observation and an idea. I'm fine with the > proposed implementation with the other nits and whatnot fixed up. No worries. It's still worth exploring all the alternatives. :) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-03 0:53 ` Dave Chinner @ 2019-10-03 14:39 ` Brian Foster 2019-10-08 3:34 ` Dave Chinner 0 siblings, 1 reply; 23+ messages in thread From: Brian Foster @ 2019-10-03 14:39 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Thu, Oct 03, 2019 at 10:53:10AM +1000, Dave Chinner wrote: > On Wed, Oct 02, 2019 at 08:40:56AM -0400, Brian Foster wrote: > > On Wed, Oct 02, 2019 at 08:31:07AM +1000, Dave Chinner wrote: > > > On Tue, Oct 01, 2019 at 09:13:04AM -0400, Brian Foster wrote: > > > > On Tue, Oct 01, 2019 at 07:53:36AM +1000, Dave Chinner wrote: > > > > > On Mon, Sep 30, 2019 at 01:03:58PM -0400, Brian Foster wrote: > > > > > > Have you done similar testing for small/minimum sized logs? > > > > > > > > > > Yes. I've had the tracepoint active during xfstests runs on test > > > > > filesystems using default log sizes on 5-15GB filesystems. The only > > > > > test in all of xfstests that has triggered it is generic/017, and it > > > > > only triggered once. > > > > > > > > > > > > > Ok, interesting. I guess it would be nice if we had a test that somehow > > > > or another more effectively exercised this codepath. > > > > > > *nod* > > > > > > But it's essentially difficult to predict in any way because > > > behaviour is not just a function of log size. :/ > > > > > > > > > Also, if this is so limited in occurrence, had you given any thought to > > > > > > something even more simple like flushing the CIL push workqueue when > > > > > > over the throttle threshold? > > > > > > > > > > Yes, I've tried that - flush_workqueue() blocks /everything/ until > > > > > all the queued work is complete. That means it waits for the CIL to > > > > > flush to disk and write a commit record, and every modification in > > > > > the filesystem is stopped dead in it's tracks until the CIL push is > > > > > complete. > > > > > > > > > > The problem is that flushing workqueues is a synchronous operation, > > > > > and it can't wait for partial work completion. We only need to wait > > > > > for the CIL context to be swapped out - this is done by the push > > > > > code before it starts all the expensive iclog formating and waiting > > > > > for iclog space... > > > > > > > > > > > > > I know it waits on the work to complete. I poked around a bit for an > > > > interface to "kick" a workqueue, so to speak (as opposed to flush), but > > > > I didn't see anything (not surprisingly, since it probably wouldn't be a > > > > broadly useful mechanism). > > > > > > *nod* > > > > > > > That aside, where would this wait on the CIL to flush to disk? AFAICT > > > > the only thing that might happen here is log buffer submission. That > > > > only happens when the log buffer is full (regardless of the current CIL > > > > push writing its commit record). In fact if we did wait on I/O anywhere > > > > in here, then that suggests potential problems with async log force. > > > > > > There is no such thing as a "async log force". The log force always > > > waits on the CIL flush - XFS_LOG_SYNC only defines whether it waits > > > on all iclogbufs post CIL flush to hit the disk. > > > > > > > I'm just referring to the semantics/naming of the existing interface. I > > suppose I could have used "a log force that doesn't wait on all > > iclogbufs to hit the disk," but that doesn't quite roll off the tongue. > > ;) > > *nod* > > Just wanted to make sure we were both talking about the same thing > :) > > > > > > Further, when the CIL flushes, it's normally flushing more metadata that we > > > can hold in 8x iclogbufs. The default is 32kB iclogbufs, so if we've > > > got more than 256kB of checkpoint data to be written, then we end up > > > waiting on iclogbuf completion to write more then 256kB. > > > > > > Typically I see a sustainted ratio of roughly 4 iclog writes to 1 > > > noiclogbufs in the metric graphs on small logs - just measured 700 > > > log writes/s, 250 noiclogs/s for a 64MB log and 256kB logbuf size. > > > IOWs, CIL flushes are regularly waiting in xlog_get_iclog_space() on > > > iclog IO completion to occur... > > > > > > > Ok, that's not quite what I was concerned about when you mentioned > > waiting on the CIL to flush to disk. No matter, the important bit here > > is the performance cost of including the extra blocking on log I/O (to > > cycle iclogs back to active for reuse) in the throttle. > > > > I'm curious about how noticeable this extra blocking would be because > > it's one likely cause of the CIL pressure buildup in the first place. My > > original tests reproduced huge CIL checkpoints purely based on one CIL > > push being blocked behind the processing of another, the latter taking > > relatively more time due to log I/O. > > It's very noticable. I dropped that as a throttle mechanism because > even on a SSD at 200us per write, a 32MB CIL flush takes for > 20-30ms. And it stops everything dead for that time because it > stalls all async transaction commits while the log flushes. > > When several CIL flushes a second occur (pushing 100-150MB/s to the > log), I start seeing a substantial amount of additional time > (15-20% of total CPU time) being spent idle just waiting for CIL > flush completion. And it only gets worse as the storage gets > slower... > > The throttle that these patches implement are typically only > stalling incoming transactions for a couple of hundred microseconds. > The timestamps from the groups of log events show the blocking of > all threads are very close together in time, and the push work to > switch over to the new context to allow them to unblock and continue > happens within another 200-300 microseconds. And the worst case I've > seen, this is happening once or twice a second. IOWs, the blocking > time of the throttle is very short and largely unnoticable, and the > difference to performance it causes is far, far less than the noise > threshold of the benchmarks. > Ok, thanks for the info. Is that 200-300us delay reliable on a sustained workload (where the throttle triggers multiple times)? If so, I suppose that longer delay I was wondering about is not as likely in practice. > > This is not to say there aren't other causes of excessively sized > > checkpoints. Rather, if we're at a point where we've blocked > > transactions on this new threshold, that means we've doubled the > > background threshold in the time we've first triggered a background CIL > > push and the push actually started. From that, it seems fairly likely > > that we could replenish the CIL to the background threshold once > > threads are unblocked but before the previous push completes. > > That's just fine. The CIL is actually designed to allow that sort of > overlap between multiple committing background contexts and the > current aggregating context. As long as each individual CIL commit > size doesn't go over half the log and there's always a complete > commit in the log (which there should be via head/tail limits), > the only limit on the number of concurrent committing CIL contexts > on the fly at once is the amount of log space we have available.... > Right, but if we're blocking transactions on the start of a CIL push and the start of a CIL push blocks on completion of the previous push, it's possible to indirectly block transactions on completion of a CIL push. There is some staggering there compared to blocking a transaction directly on a CIL push start to finish, however, so perhaps that keeps the delay down in practice. > > The question is: can we get all the way to the blocking threshold before > > that happens? That doesn't seem unrealistic to me, but it's hard to > > reason about without having tested it. If so, I think it means we end up > > blocking on completion of the first push to some degree anyways. > > Yes, we can fill and push the current sequence before the previous > sequence has finished committing. > > > > > > xlog_cil_push_now() uses flush_work() to push any pending work > > > > > before it queues up the CIL flush that the caller is about to wait > > > > > for. i.e. the callers of xlog_cil_push_now() must ensure that all > > > > > CIL contexts are flushed for the purposes of a log force as they are > > > > > going to wait for all pending CIL flushes to complete. If we've > > > > > already pushed the CIL to the sequence that we are asking to push > > > > > to, we still have to wait for that previous push to be > > > > > done. This is what the flush_work() call in xlog_cil_push_now() > > > > > acheives. > > > > > > > > > > > > > Yes, I'm just exploring potential to reuse this code.. > > > > > > Yeah, I have a few prototype patches for revamping this, including > > > an actual async CIL flush. I do some work here, but it didn't solve > > > any of the problems I needed to fix so it put it aside. See below. > > > > > > > That sounds more involved than what I was thinking. My thought is that > > this throttle is already not predictable or deterministic (i.e. we're > > essentially waiting on a scheduler event) and so might not require the > > extra complexity of a new waitqueue. It certainly could be the case that > > blocking on the entire push is just too long in practice, but since this > > I've made that sort of change here and measured the regression. > Then I discarded it as an unworkable option and looked for other > solutions. > Ok. What was the regression that resulted from the longer throttle delay, out of curiosity? Brian > > is already based on empirical evidence and subject to unpredictability, > > ISTM that testing is the only way to know for sure. For reference, I > > hacked up something to reuse xlog_cil_push_now() for background pushing > > and throttling that ends up removing 20 or so lines of code by the time > > it's in place, but I haven't given it any testing. > > > > That said, this is just an observation and an idea. I'm fine with the > > proposed implementation with the other nits and whatnot fixed up. > > No worries. It's still worth exploring all the alternatives. :) > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push 2019-10-03 14:39 ` Brian Foster @ 2019-10-08 3:34 ` Dave Chinner 0 siblings, 0 replies; 23+ messages in thread From: Dave Chinner @ 2019-10-08 3:34 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Thu, Oct 03, 2019 at 10:39:34AM -0400, Brian Foster wrote: > On Thu, Oct 03, 2019 at 10:53:10AM +1000, Dave Chinner wrote: > > On Wed, Oct 02, 2019 at 08:40:56AM -0400, Brian Foster wrote: > > > Ok, that's not quite what I was concerned about when you mentioned > > > waiting on the CIL to flush to disk. No matter, the important bit here > > > is the performance cost of including the extra blocking on log I/O (to > > > cycle iclogs back to active for reuse) in the throttle. > > > > > > I'm curious about how noticeable this extra blocking would be because > > > it's one likely cause of the CIL pressure buildup in the first place. My > > > original tests reproduced huge CIL checkpoints purely based on one CIL > > > push being blocked behind the processing of another, the latter taking > > > relatively more time due to log I/O. > > > > It's very noticable. I dropped that as a throttle mechanism because > > even on a SSD at 200us per write, a 32MB CIL flush takes for > > 20-30ms. And it stops everything dead for that time because it > > stalls all async transaction commits while the log flushes. > > > > When several CIL flushes a second occur (pushing 100-150MB/s to the > > log), I start seeing a substantial amount of additional time > > (15-20% of total CPU time) being spent idle just waiting for CIL > > flush completion. And it only gets worse as the storage gets > > slower... > > > > The throttle that these patches implement are typically only > > stalling incoming transactions for a couple of hundred microseconds. > > The timestamps from the groups of log events show the blocking of > > all threads are very close together in time, and the push work to > > switch over to the new context to allow them to unblock and continue > > happens within another 200-300 microseconds. And the worst case I've > > seen, this is happening once or twice a second. IOWs, the blocking > > time of the throttle is very short and largely unnoticable, and the > > difference to performance it causes is far, far less than the noise > > threshold of the benchmarks. > > > > Ok, thanks for the info. Is that 200-300us delay reliable on a sustained > workload (where the throttle triggers multiple times)? If so, I suppose > that longer delay I was wondering about is not as likely in practice. THe more threads there are, the longer the capture and delay time, but I'm not seeing delays that are long enough to measurably reduce the CPU consumption of the workloads nor the throughput of the benchmark outside of the normal noise floor. > > > This is not to say there aren't other causes of excessively sized > > > checkpoints. Rather, if we're at a point where we've blocked > > > transactions on this new threshold, that means we've doubled the > > > background threshold in the time we've first triggered a background CIL > > > push and the push actually started. From that, it seems fairly likely > > > that we could replenish the CIL to the background threshold once > > > threads are unblocked but before the previous push completes. > > > > That's just fine. The CIL is actually designed to allow that sort of > > overlap between multiple committing background contexts and the > > current aggregating context. As long as each individual CIL commit > > size doesn't go over half the log and there's always a complete > > commit in the log (which there should be via head/tail limits), > > the only limit on the number of concurrent committing CIL contexts > > on the fly at once is the amount of log space we have available.... > > > > Right, but if we're blocking transactions on the start of a CIL push and > the start of a CIL push blocks on completion of the previous push, it's > possible to indirectly block transactions on completion of a CIL push. > There is some staggering there compared to blocking a transaction > directly on a CIL push start to finish, however, so perhaps that keeps > the delay down in practice. *nod* The idea is that this all works as a multi-stage pipeline so there is always overlap and work being done, even if there are short stalls in individual work units in the pipeline: reservation modification commit cil insert Cil push iclog write iclog completion AIL insert AIL push metadata IO completion AIL removal log tail updates Given that CPUs are generally much faster than the IO part of the pipeline, short CPU scheduling delays don't introduce stalls in the IO parts of the pipeline and so largely don't affect throughput. It's when delays become long enough to fill the buffers of the previous stage and stall multiple stages that performance goes down. Ideally, there should be no stalls or pipeline delays after the transaction has a reservation as those are bound by the final stage throughput (log tail updates). What we are doing here is adding more feedback between smaller parts of the pipeline to manage the flow of modifications within the pipeline. In this case, it is between the iclog writes and the CIL insert stages so that we don't buffer the entire pipeline's work on the CIL insert stage when CIL pushes stall on iclog writes due to slowness in iclog completion and (perhaps) AIL insertion stages. So, yeah, we can commit and insert in the CIL while a push is occurring, and mostly they will happen at the same time and the CIL is large enough that a typical push will complete at the same time the cil inserts fill a new queue to be pushed.... > > > That sounds more involved than what I was thinking. My thought is that > > > this throttle is already not predictable or deterministic (i.e. we're > > > essentially waiting on a scheduler event) and so might not require the > > > extra complexity of a new waitqueue. It certainly could be the case that > > > blocking on the entire push is just too long in practice, but since this > > > > I've made that sort of change here and measured the regression. > > Then I discarded it as an unworkable option and looked for other > > solutions. > > Ok. What was the regression that resulted from the longer throttle > delay, out of curiosity? Well, I tried two things. The first was throttling by waiting for the entire CIL push by issuing a flush_work() similar to the xlog_cil_push_now() code. That introduced substantial pipeline stalls at the CIL insert point, and performance went way down. The second was moving the up_write() in xlog_cil_push() down to after then CIL had been formatted and written to the iclogs entirely. This didn't wait on the commit record write, so removed taht part of the serialisation from the throttling. i.e. this stalls incoming inserts (commits) on the down_read() of the context lock until the CIL push had processed the CIL. This didn't make much difference, because all the work is in formatting the thousands of CIL logvecs into the iclogs and writing them to disk. Basically, blocking the CIL insert on CIL pushes and iclog IO creates pipeline stalls and prevents us from committing new changes while the CIL is flushing. Hence we reduce concurrency in the transaction commit path, and performance goes down accordingly... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2019-10-08 17:34 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-09-30 6:03 [PATCH v2 0/2] xfs: limit CIL push sizes Dave Chinner 2019-09-30 6:03 ` [PATCH 1/2] xfs: Lower CIL flush limit for large logs Dave Chinner 2019-09-30 16:55 ` Brian Foster 2019-09-30 6:03 ` [PATCH 2/2] xfs: Throttle commits on delayed background CIL push Dave Chinner 2019-09-30 17:03 ` Brian Foster 2019-09-30 21:53 ` Dave Chinner 2019-10-01 3:42 ` Dave Chinner 2019-10-01 13:13 ` Brian Foster 2019-10-01 23:14 ` Dave Chinner 2019-10-02 12:41 ` Brian Foster 2019-10-03 1:25 ` Dave Chinner 2019-10-03 14:41 ` Brian Foster 2019-10-04 2:27 ` Dave Chinner 2019-10-04 11:50 ` Brian Foster 2019-10-08 2:51 ` Dave Chinner 2019-10-08 13:22 ` Brian Foster 2019-10-08 17:34 ` Brian Foster 2019-10-01 13:13 ` Brian Foster 2019-10-01 22:31 ` Dave Chinner 2019-10-02 12:40 ` Brian Foster 2019-10-03 0:53 ` Dave Chinner 2019-10-03 14:39 ` Brian Foster 2019-10-08 3:34 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).