* [PATCH 0/3] xfs: CIL improvements @ 2021-02-23 5:32 Dave Chinner 2021-02-23 5:32 ` [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Dave Chinner @ 2021-02-23 5:32 UTC (permalink / raw) To: linux-xfs HI folks, The following patches improve the behaviour of the CIL and increase the processing capacity available for pushing changes into the journal. The first patch is an interface change, fixing something I did when first introducing delayed logging to ease the understanding of the change the CIL placed on the rest of the code. That was to "pretend" that the log items still recorded the log sequence numbers that they were comitted to, and that these LSNs could be passed to xfs_log_force_lsn() to force the log to a certain LSN. Truth is that the CIL hands out CIL sequence numbers to log items on commits, not journal log sequence numbers. The LSNs are now essentially and internal log and AIL implementation detail, everything else that runs transactions and interfaces with log items to force the log store around CIL sequences. Hence the first patch converts xfs_log_force_lsn() to xfs_log_force_seq() and all the high level stuff to refer to sequence numbers rather than LSNs. The second patch gets rid of severe latency issues that the AIL can see due to having to force the log to unpin items. While it is running a log force, it does nothing and hence we can run out of log space before the log force returns and the AIL starts pushing again. The AIL only needs to push on the CIL to get items unpinned, and it doesn't need to wait for it to complete, either, before it continues onwards trying to push out items to disk. The AIL will back off when it reaches target, so it doesn't need to wait on log forces to back off when there are pinned items in the AIL. Hence this patch adds a mechanism to do an async push on the CIL that does not block and changes the AIL to use it. This results in the AIL backing off on it's own short timeouts and trying to make progress repeatedly instead of stalling for seconds waiting for log forces to complete. The last patch is a fix to the pipelining of the CIL pushes. The pipelining isn't working as intended, it's actually serialising and resulting in the CIL push work being CPU bound and limiting the rate at which items can be pushed to the journal. It is also creating excessive push latency where the CIL fills and hits the hard throttle while waiting for the push work to finish the current push and then start on the new push and swap in a new CIL context that can be committed to. Essentially, the problem is an implementation problem, not a design flaw. The implementation has a single work attached to the CIL, meaning we can only have a single outstanding push work in progress at any time. The workqueue can handle more, but we only have a single work. So the fix is to move the work to the CIL context so we can queue and process multiple works at the same time, thereby actually allowing the CIL push work to pipeline in the intended manner. Cheers, Dave. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN 2021-02-23 5:32 [PATCH 0/3] xfs: CIL improvements Dave Chinner @ 2021-02-23 5:32 ` Dave Chinner 2021-02-24 21:42 ` Darrick J. Wong 2021-03-02 18:12 ` Brian Foster 2021-02-23 5:32 ` [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing Dave Chinner 2021-02-23 5:32 ` [PATCH 3/3] xfs: CIL work is serialised, not pipelined Dave Chinner 2 siblings, 2 replies; 33+ messages in thread From: Dave Chinner @ 2021-02-23 5:32 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> In doing an investigation into AIL push stalls, I was looking at the log force code to see if an async CIL push could be done instead. This lead me to xfs_log_force_lsn() and looking at how it works. xfs_log_force_lsn() is only called from inode synchronisation contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn value as the LSN to sync the log to. This gets passed to xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the journal, and then used by xfs_log_force_lsn() to flush the iclogs to the journal. The problem with is that ip->i_itemp->ili_last_lsn does not store a log sequence number. What it stores is passed to it from the ->iop_committing method, which is called by xfs_log_commit_cil(). The value this passes to the iop_committing method is the CIL context sequence number that the item was committed to. As it turns out, xlog_cil_force_lsn() converts the sequence to an actual commit LSN for the related context and returns that to xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn" variable that contained a sequence with an actual LSN and then uses that to sync the iclogs. This caused me some confusion for a while, even though I originally wrote all this code a decade ago. ->iop_committing is only used by a couple of log item types, and only inode items use the sequence number it is passed. Let's clean up the API, CIL structures and inode log item to call it a sequence number, and make it clear that the high level code is using CIL sequence numbers and not on-disk LSNs for integrity synchronisation purposes. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_buf_item.c | 2 +- fs/xfs/xfs_dquot_item.c | 2 +- fs/xfs/xfs_file.c | 14 +++++++------- fs/xfs/xfs_inode.c | 10 +++++----- fs/xfs/xfs_inode_item.c | 4 ++-- fs/xfs/xfs_inode_item.h | 2 +- fs/xfs/xfs_log.c | 27 ++++++++++++++------------- fs/xfs/xfs_log.h | 4 +--- fs/xfs/xfs_log_cil.c | 22 +++++++++------------- fs/xfs/xfs_log_priv.h | 15 +++++++-------- fs/xfs/xfs_trans.c | 6 +++--- fs/xfs/xfs_trans.h | 4 ++-- 12 files changed, 53 insertions(+), 59 deletions(-) diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index 14d1fefcbf4c..7affe1aa16da 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -713,7 +713,7 @@ xfs_buf_item_release( STATIC void xfs_buf_item_committing( struct xfs_log_item *lip, - xfs_lsn_t commit_lsn) + uint64_t seq) { return xfs_buf_item_release(lip); } diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c index 8c1fdf37ee8f..74ca8fbbf1f3 100644 --- a/fs/xfs/xfs_dquot_item.c +++ b/fs/xfs/xfs_dquot_item.c @@ -188,7 +188,7 @@ xfs_qm_dquot_logitem_release( STATIC void xfs_qm_dquot_logitem_committing( struct xfs_log_item *lip, - xfs_lsn_t commit_lsn) + uint64_t seq) { return xfs_qm_dquot_logitem_release(lip); } diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index dd33ef2d0e20..8b07408928a9 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -118,8 +118,8 @@ xfs_dir_fsync( return xfs_log_force_inode(ip); } -static xfs_lsn_t -xfs_fsync_lsn( +static uint64_t +xfs_fsync_seq( struct xfs_inode *ip, bool datasync) { @@ -127,7 +127,7 @@ xfs_fsync_lsn( return 0; if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) return 0; - return ip->i_itemp->ili_last_lsn; + return ip->i_itemp->ili_commit_seq; } /* @@ -150,12 +150,12 @@ xfs_fsync_flush_log( int *log_flushed) { int error = 0; - xfs_lsn_t lsn; + uint64_t seq; xfs_ilock(ip, XFS_ILOCK_SHARED); - lsn = xfs_fsync_lsn(ip, datasync); - if (lsn) { - error = xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, + seq = xfs_fsync_seq(ip, datasync); + if (seq) { + error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, log_flushed); spin_lock(&ip->i_itemp->ili_lock); diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 95e3a5e6e5e2..bc3e27f9c7f2 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2636,7 +2636,7 @@ xfs_iunpin( trace_xfs_inode_unpin_nowait(ip, _RET_IP_); /* Give the log a push to start the unpinning I/O */ - xfs_log_force_lsn(ip->i_mount, ip->i_itemp->ili_last_lsn, 0, NULL); + xfs_log_force_seq(ip->i_mount, ip->i_itemp->ili_commit_seq, 0, NULL); } @@ -3641,16 +3641,16 @@ int xfs_log_force_inode( struct xfs_inode *ip) { - xfs_lsn_t lsn = 0; + uint64_t seq = 0; xfs_ilock(ip, XFS_ILOCK_SHARED); if (xfs_ipincount(ip)) - lsn = ip->i_itemp->ili_last_lsn; + seq = ip->i_itemp->ili_commit_seq; xfs_iunlock(ip, XFS_ILOCK_SHARED); - if (!lsn) + if (!seq) return 0; - return xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, NULL); + return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, NULL); } /* diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index 6ff91e5bf3cd..7270a80f21dc 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -617,9 +617,9 @@ xfs_inode_item_committed( STATIC void xfs_inode_item_committing( struct xfs_log_item *lip, - xfs_lsn_t commit_lsn) + uint64_t seq) { - INODE_ITEM(lip)->ili_last_lsn = commit_lsn; + INODE_ITEM(lip)->ili_commit_seq = seq; return xfs_inode_item_release(lip); } diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h index 4b926e32831c..e60125b88292 100644 --- a/fs/xfs/xfs_inode_item.h +++ b/fs/xfs/xfs_inode_item.h @@ -33,7 +33,7 @@ struct xfs_inode_log_item { unsigned int ili_fields; /* fields to be logged */ unsigned int ili_fsync_fields; /* logged since last fsync */ xfs_lsn_t ili_flush_lsn; /* lsn at last flush */ - xfs_lsn_t ili_last_lsn; /* lsn at last transaction */ + uint64_t ili_commit_seq; /* last transaction commit */ }; static inline int xfs_inode_clean(struct xfs_inode *ip) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 08d68a6161ae..84cd9b6c6d1f 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -3336,14 +3336,13 @@ xfs_log_force( } static int -__xfs_log_force_lsn( - struct xfs_mount *mp, +xlog_force_lsn( + struct xlog *log, xfs_lsn_t lsn, uint flags, int *log_flushed, bool already_slept) { - struct xlog *log = mp->m_log; struct xlog_in_core *iclog; spin_lock(&log->l_icloglock); @@ -3376,8 +3375,6 @@ __xfs_log_force_lsn( if (!already_slept && (iclog->ic_prev->ic_state == XLOG_STATE_WANT_SYNC || iclog->ic_prev->ic_state == XLOG_STATE_SYNCING)) { - XFS_STATS_INC(mp, xs_log_force_sleep); - xlog_wait(&iclog->ic_prev->ic_write_wait, &log->l_icloglock); return -EAGAIN; @@ -3415,25 +3412,29 @@ __xfs_log_force_lsn( * to disk, that thread will wake up all threads waiting on the queue. */ int -xfs_log_force_lsn( +xfs_log_force_seq( struct xfs_mount *mp, - xfs_lsn_t lsn, + uint64_t seq, uint flags, int *log_flushed) { + struct xlog *log = mp->m_log; + xfs_lsn_t lsn; int ret; - ASSERT(lsn != 0); + ASSERT(seq != 0); XFS_STATS_INC(mp, xs_log_force); - trace_xfs_log_force(mp, lsn, _RET_IP_); + trace_xfs_log_force(mp, seq, _RET_IP_); - lsn = xlog_cil_force_lsn(mp->m_log, lsn); + lsn = xlog_cil_force_seq(log, seq); if (lsn == NULLCOMMITLSN) return 0; - ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, false); - if (ret == -EAGAIN) - ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, true); + ret = xlog_force_lsn(log, lsn, flags, log_flushed, false); + if (ret == -EAGAIN) { + XFS_STATS_INC(mp, xs_log_force_sleep); + ret = xlog_force_lsn(log, lsn, flags, log_flushed, true); + } return ret; } diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 044e02cb8921..33ae53401060 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -106,7 +106,7 @@ struct xfs_item_ops; struct xfs_trans; int xfs_log_force(struct xfs_mount *mp, uint flags); -int xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags, +int xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags, int *log_forced); int xfs_log_mount(struct xfs_mount *mp, struct xfs_buftarg *log_target, @@ -132,8 +132,6 @@ bool xfs_log_writable(struct xfs_mount *mp); struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket); void xfs_log_ticket_put(struct xlog_ticket *ticket); -void xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp, - xfs_lsn_t *commit_lsn, bool regrant); void xlog_cil_process_committed(struct list_head *list); bool xfs_log_item_in_current_chkpt(struct xfs_log_item *lip); diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 0a00c3c9610c..2fda8c4513b2 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -792,7 +792,7 @@ xlog_cil_push_work( * that higher sequences will wait for us to write out a commit record * before they do. * - * xfs_log_force_lsn requires us to mirror the new sequence into the cil + * xfs_log_force_seq requires us to mirror the new sequence into the cil * structure atomically with the addition of this sequence to the * committing list. This also ensures that we can do unlocked checks * against the current sequence in log forces without risking @@ -1054,16 +1054,14 @@ xlog_cil_empty( * allowed again. */ void -xfs_log_commit_cil( - struct xfs_mount *mp, +xlog_cil_commit( + struct xlog *log, struct xfs_trans *tp, - xfs_lsn_t *commit_lsn, + uint64_t *commit_seq, bool regrant) { - struct xlog *log = mp->m_log; struct xfs_cil *cil = log->l_cilp; struct xfs_log_item *lip, *next; - xfs_lsn_t xc_commit_lsn; /* * Do all necessary memory allocation before we lock the CIL. @@ -1077,10 +1075,6 @@ xfs_log_commit_cil( xlog_cil_insert_items(log, tp); - xc_commit_lsn = cil->xc_ctx->sequence; - if (commit_lsn) - *commit_lsn = xc_commit_lsn; - if (regrant && !XLOG_FORCED_SHUTDOWN(log)) xfs_log_ticket_regrant(log, tp->t_ticket); else @@ -1103,8 +1097,10 @@ xfs_log_commit_cil( list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) { xfs_trans_del_item(lip); if (lip->li_ops->iop_committing) - lip->li_ops->iop_committing(lip, xc_commit_lsn); + lip->li_ops->iop_committing(lip, cil->xc_ctx->sequence); } + if (commit_seq) + *commit_seq = cil->xc_ctx->sequence; /* xlog_cil_push_background() releases cil->xc_ctx_lock */ xlog_cil_push_background(log); @@ -1121,9 +1117,9 @@ xfs_log_commit_cil( * iclog flush is necessary following this call. */ xfs_lsn_t -xlog_cil_force_lsn( +xlog_cil_force_seq( struct xlog *log, - xfs_lsn_t sequence) + uint64_t sequence) { struct xfs_cil *cil = log->l_cilp; struct xfs_cil_ctx *ctx; diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 24acdc54e44e..59778cd5ecdd 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -234,7 +234,7 @@ struct xfs_cil; struct xfs_cil_ctx { struct xfs_cil *cil; - xfs_lsn_t sequence; /* chkpt sequence # */ + uint64_t sequence; /* chkpt sequence # */ xfs_lsn_t start_lsn; /* first LSN of chkpt commit */ xfs_lsn_t commit_lsn; /* chkpt commit record lsn */ struct xlog_ticket *ticket; /* chkpt ticket */ @@ -272,10 +272,10 @@ struct xfs_cil { struct xfs_cil_ctx *xc_ctx; spinlock_t xc_push_lock ____cacheline_aligned_in_smp; - xfs_lsn_t xc_push_seq; + uint64_t xc_push_seq; struct list_head xc_committing; wait_queue_head_t xc_commit_wait; - xfs_lsn_t xc_current_sequence; + uint64_t xc_current_sequence; struct work_struct xc_push_work; wait_queue_head_t xc_push_wait; /* background push throttle */ } ____cacheline_aligned_in_smp; @@ -552,19 +552,18 @@ int xlog_cil_init(struct xlog *log); void xlog_cil_init_post_recovery(struct xlog *log); void xlog_cil_destroy(struct xlog *log); bool xlog_cil_empty(struct xlog *log); +void xlog_cil_commit(struct xlog *log, struct xfs_trans *tp, + uint64_t *commit_seq, bool regrant); /* * CIL force routines */ -xfs_lsn_t -xlog_cil_force_lsn( - struct xlog *log, - xfs_lsn_t sequence); +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence); static inline void xlog_cil_force(struct xlog *log) { - xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence); + xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence); } /* diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 44f72c09c203..697703f3be48 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -849,7 +849,7 @@ __xfs_trans_commit( bool regrant) { struct xfs_mount *mp = tp->t_mountp; - xfs_lsn_t commit_lsn = -1; + uint64_t commit_seq = 0; int error = 0; int sync = tp->t_flags & XFS_TRANS_SYNC; @@ -891,7 +891,7 @@ __xfs_trans_commit( xfs_trans_apply_sb_deltas(tp); xfs_trans_apply_dquot_deltas(tp); - xfs_log_commit_cil(mp, tp, &commit_lsn, regrant); + xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant); current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); xfs_trans_free(tp); @@ -901,7 +901,7 @@ __xfs_trans_commit( * log out now and wait for it. */ if (sync) { - error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL); + error = xfs_log_force_seq(mp, commit_seq, XFS_LOG_SYNC, NULL); XFS_STATS_INC(mp, xs_trans_sync); } else { XFS_STATS_INC(mp, xs_trans_async); diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h index 8b03fbfe9a1b..d223d4f4e429 100644 --- a/fs/xfs/xfs_trans.h +++ b/fs/xfs/xfs_trans.h @@ -43,7 +43,7 @@ struct xfs_log_item { struct list_head li_cil; /* CIL pointers */ struct xfs_log_vec *li_lv; /* active log vector */ struct xfs_log_vec *li_lv_shadow; /* standby vector */ - xfs_lsn_t li_seq; /* CIL commit seq */ + uint64_t li_seq; /* CIL commit seq */ }; /* @@ -69,7 +69,7 @@ struct xfs_item_ops { void (*iop_pin)(struct xfs_log_item *); void (*iop_unpin)(struct xfs_log_item *, int remove); uint (*iop_push)(struct xfs_log_item *, struct list_head *); - void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn); + void (*iop_committing)(struct xfs_log_item *lip, uint64_t seq); void (*iop_release)(struct xfs_log_item *); xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t); int (*iop_recover)(struct xfs_log_item *lip, -- 2.28.0 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN 2021-02-23 5:32 ` [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner @ 2021-02-24 21:42 ` Darrick J. Wong 2021-02-24 22:19 ` Dave Chinner 2021-02-25 19:01 ` Christoph Hellwig 2021-03-02 18:12 ` Brian Foster 1 sibling, 2 replies; 33+ messages in thread From: Darrick J. Wong @ 2021-02-24 21:42 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Feb 23, 2021 at 04:32:10PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > In doing an investigation into AIL push stalls, I was looking at the > log force code to see if an async CIL push could be done instead. > This lead me to xfs_log_force_lsn() and looking at how it works. > > xfs_log_force_lsn() is only called from inode synchronisation > contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn > value as the LSN to sync the log to. This gets passed to > xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the > journal, and then used by xfs_log_force_lsn() to flush the iclogs to > the journal. > > The problem with is that ip->i_itemp->ili_last_lsn does not store a > log sequence number. What it stores is passed to it from the > ->iop_committing method, which is called by xfs_log_commit_cil(). > The value this passes to the iop_committing method is the CIL > context sequence number that the item was committed to. > > As it turns out, xlog_cil_force_lsn() converts the sequence to an > actual commit LSN for the related context and returns that to > xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn" > variable that contained a sequence with an actual LSN and then uses > that to sync the iclogs. > > This caused me some confusion for a while, even though I originally > wrote all this code a decade ago. ->iop_committing is only used by > a couple of log item types, and only inode items use the sequence > number it is passed. > > Let's clean up the API, CIL structures and inode log item to call it > a sequence number, and make it clear that the high level code is > using CIL sequence numbers and not on-disk LSNs for integrity > synchronisation purposes. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_buf_item.c | 2 +- > fs/xfs/xfs_dquot_item.c | 2 +- > fs/xfs/xfs_file.c | 14 +++++++------- > fs/xfs/xfs_inode.c | 10 +++++----- > fs/xfs/xfs_inode_item.c | 4 ++-- > fs/xfs/xfs_inode_item.h | 2 +- > fs/xfs/xfs_log.c | 27 ++++++++++++++------------- > fs/xfs/xfs_log.h | 4 +--- > fs/xfs/xfs_log_cil.c | 22 +++++++++------------- > fs/xfs/xfs_log_priv.h | 15 +++++++-------- > fs/xfs/xfs_trans.c | 6 +++--- > fs/xfs/xfs_trans.h | 4 ++-- > 12 files changed, 53 insertions(+), 59 deletions(-) > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c > index 14d1fefcbf4c..7affe1aa16da 100644 > --- a/fs/xfs/xfs_buf_item.c > +++ b/fs/xfs/xfs_buf_item.c > @@ -713,7 +713,7 @@ xfs_buf_item_release( > STATIC void > xfs_buf_item_committing( > struct xfs_log_item *lip, > - xfs_lsn_t commit_lsn) > + uint64_t seq) FWIW I rather wish you'd defined a new type for cil sequence numbers, since uint64_t is rather generic. Even if checkpatch whines about new typedefs. I was kind of hoping that we'd be able to mark xfs_lsn_t and xfs_csn_t with __bitwise and so static checkers could catch us if we accidentally feed a CIL sequence number into a function that wants an LSN. --D > { > return xfs_buf_item_release(lip); > } > diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c > index 8c1fdf37ee8f..74ca8fbbf1f3 100644 > --- a/fs/xfs/xfs_dquot_item.c > +++ b/fs/xfs/xfs_dquot_item.c > @@ -188,7 +188,7 @@ xfs_qm_dquot_logitem_release( > STATIC void > xfs_qm_dquot_logitem_committing( > struct xfs_log_item *lip, > - xfs_lsn_t commit_lsn) > + uint64_t seq) > { > return xfs_qm_dquot_logitem_release(lip); > } > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index dd33ef2d0e20..8b07408928a9 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -118,8 +118,8 @@ xfs_dir_fsync( > return xfs_log_force_inode(ip); > } > > -static xfs_lsn_t > -xfs_fsync_lsn( > +static uint64_t > +xfs_fsync_seq( > struct xfs_inode *ip, > bool datasync) > { > @@ -127,7 +127,7 @@ xfs_fsync_lsn( > return 0; > if (datasync && !(ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > return 0; > - return ip->i_itemp->ili_last_lsn; > + return ip->i_itemp->ili_commit_seq; > } > > /* > @@ -150,12 +150,12 @@ xfs_fsync_flush_log( > int *log_flushed) > { > int error = 0; > - xfs_lsn_t lsn; > + uint64_t seq; > > xfs_ilock(ip, XFS_ILOCK_SHARED); > - lsn = xfs_fsync_lsn(ip, datasync); > - if (lsn) { > - error = xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, > + seq = xfs_fsync_seq(ip, datasync); > + if (seq) { > + error = xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, > log_flushed); > > spin_lock(&ip->i_itemp->ili_lock); > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > index 95e3a5e6e5e2..bc3e27f9c7f2 100644 > --- a/fs/xfs/xfs_inode.c > +++ b/fs/xfs/xfs_inode.c > @@ -2636,7 +2636,7 @@ xfs_iunpin( > trace_xfs_inode_unpin_nowait(ip, _RET_IP_); > > /* Give the log a push to start the unpinning I/O */ > - xfs_log_force_lsn(ip->i_mount, ip->i_itemp->ili_last_lsn, 0, NULL); > + xfs_log_force_seq(ip->i_mount, ip->i_itemp->ili_commit_seq, 0, NULL); > > } > > @@ -3641,16 +3641,16 @@ int > xfs_log_force_inode( > struct xfs_inode *ip) > { > - xfs_lsn_t lsn = 0; > + uint64_t seq = 0; > > xfs_ilock(ip, XFS_ILOCK_SHARED); > if (xfs_ipincount(ip)) > - lsn = ip->i_itemp->ili_last_lsn; > + seq = ip->i_itemp->ili_commit_seq; > xfs_iunlock(ip, XFS_ILOCK_SHARED); > > - if (!lsn) > + if (!seq) > return 0; > - return xfs_log_force_lsn(ip->i_mount, lsn, XFS_LOG_SYNC, NULL); > + return xfs_log_force_seq(ip->i_mount, seq, XFS_LOG_SYNC, NULL); > } > > /* > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index 6ff91e5bf3cd..7270a80f21dc 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -617,9 +617,9 @@ xfs_inode_item_committed( > STATIC void > xfs_inode_item_committing( > struct xfs_log_item *lip, > - xfs_lsn_t commit_lsn) > + uint64_t seq) > { > - INODE_ITEM(lip)->ili_last_lsn = commit_lsn; > + INODE_ITEM(lip)->ili_commit_seq = seq; > return xfs_inode_item_release(lip); > } > > diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h > index 4b926e32831c..e60125b88292 100644 > --- a/fs/xfs/xfs_inode_item.h > +++ b/fs/xfs/xfs_inode_item.h > @@ -33,7 +33,7 @@ struct xfs_inode_log_item { > unsigned int ili_fields; /* fields to be logged */ > unsigned int ili_fsync_fields; /* logged since last fsync */ > xfs_lsn_t ili_flush_lsn; /* lsn at last flush */ > - xfs_lsn_t ili_last_lsn; /* lsn at last transaction */ > + uint64_t ili_commit_seq; /* last transaction commit */ > }; > > static inline int xfs_inode_clean(struct xfs_inode *ip) > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index 08d68a6161ae..84cd9b6c6d1f 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -3336,14 +3336,13 @@ xfs_log_force( > } > > static int > -__xfs_log_force_lsn( > - struct xfs_mount *mp, > +xlog_force_lsn( > + struct xlog *log, > xfs_lsn_t lsn, > uint flags, > int *log_flushed, > bool already_slept) > { > - struct xlog *log = mp->m_log; > struct xlog_in_core *iclog; > > spin_lock(&log->l_icloglock); > @@ -3376,8 +3375,6 @@ __xfs_log_force_lsn( > if (!already_slept && > (iclog->ic_prev->ic_state == XLOG_STATE_WANT_SYNC || > iclog->ic_prev->ic_state == XLOG_STATE_SYNCING)) { > - XFS_STATS_INC(mp, xs_log_force_sleep); > - > xlog_wait(&iclog->ic_prev->ic_write_wait, > &log->l_icloglock); > return -EAGAIN; > @@ -3415,25 +3412,29 @@ __xfs_log_force_lsn( > * to disk, that thread will wake up all threads waiting on the queue. > */ > int > -xfs_log_force_lsn( > +xfs_log_force_seq( > struct xfs_mount *mp, > - xfs_lsn_t lsn, > + uint64_t seq, > uint flags, > int *log_flushed) > { > + struct xlog *log = mp->m_log; > + xfs_lsn_t lsn; > int ret; > - ASSERT(lsn != 0); > + ASSERT(seq != 0); > > XFS_STATS_INC(mp, xs_log_force); > - trace_xfs_log_force(mp, lsn, _RET_IP_); > + trace_xfs_log_force(mp, seq, _RET_IP_); > > - lsn = xlog_cil_force_lsn(mp->m_log, lsn); > + lsn = xlog_cil_force_seq(log, seq); > if (lsn == NULLCOMMITLSN) > return 0; > > - ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, false); > - if (ret == -EAGAIN) > - ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, true); > + ret = xlog_force_lsn(log, lsn, flags, log_flushed, false); > + if (ret == -EAGAIN) { > + XFS_STATS_INC(mp, xs_log_force_sleep); > + ret = xlog_force_lsn(log, lsn, flags, log_flushed, true); > + } > return ret; > } > > diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h > index 044e02cb8921..33ae53401060 100644 > --- a/fs/xfs/xfs_log.h > +++ b/fs/xfs/xfs_log.h > @@ -106,7 +106,7 @@ struct xfs_item_ops; > struct xfs_trans; > > int xfs_log_force(struct xfs_mount *mp, uint flags); > -int xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags, > +int xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags, > int *log_forced); > int xfs_log_mount(struct xfs_mount *mp, > struct xfs_buftarg *log_target, > @@ -132,8 +132,6 @@ bool xfs_log_writable(struct xfs_mount *mp); > struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket); > void xfs_log_ticket_put(struct xlog_ticket *ticket); > > -void xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp, > - xfs_lsn_t *commit_lsn, bool regrant); > void xlog_cil_process_committed(struct list_head *list); > bool xfs_log_item_in_current_chkpt(struct xfs_log_item *lip); > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 0a00c3c9610c..2fda8c4513b2 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -792,7 +792,7 @@ xlog_cil_push_work( > * that higher sequences will wait for us to write out a commit record > * before they do. > * > - * xfs_log_force_lsn requires us to mirror the new sequence into the cil > + * xfs_log_force_seq requires us to mirror the new sequence into the cil > * structure atomically with the addition of this sequence to the > * committing list. This also ensures that we can do unlocked checks > * against the current sequence in log forces without risking > @@ -1054,16 +1054,14 @@ xlog_cil_empty( > * allowed again. > */ > void > -xfs_log_commit_cil( > - struct xfs_mount *mp, > +xlog_cil_commit( > + struct xlog *log, > struct xfs_trans *tp, > - xfs_lsn_t *commit_lsn, > + uint64_t *commit_seq, > bool regrant) > { > - struct xlog *log = mp->m_log; > struct xfs_cil *cil = log->l_cilp; > struct xfs_log_item *lip, *next; > - xfs_lsn_t xc_commit_lsn; > > /* > * Do all necessary memory allocation before we lock the CIL. > @@ -1077,10 +1075,6 @@ xfs_log_commit_cil( > > xlog_cil_insert_items(log, tp); > > - xc_commit_lsn = cil->xc_ctx->sequence; > - if (commit_lsn) > - *commit_lsn = xc_commit_lsn; > - > if (regrant && !XLOG_FORCED_SHUTDOWN(log)) > xfs_log_ticket_regrant(log, tp->t_ticket); > else > @@ -1103,8 +1097,10 @@ xfs_log_commit_cil( > list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) { > xfs_trans_del_item(lip); > if (lip->li_ops->iop_committing) > - lip->li_ops->iop_committing(lip, xc_commit_lsn); > + lip->li_ops->iop_committing(lip, cil->xc_ctx->sequence); > } > + if (commit_seq) > + *commit_seq = cil->xc_ctx->sequence; > > /* xlog_cil_push_background() releases cil->xc_ctx_lock */ > xlog_cil_push_background(log); > @@ -1121,9 +1117,9 @@ xfs_log_commit_cil( > * iclog flush is necessary following this call. > */ > xfs_lsn_t > -xlog_cil_force_lsn( > +xlog_cil_force_seq( > struct xlog *log, > - xfs_lsn_t sequence) > + uint64_t sequence) > { > struct xfs_cil *cil = log->l_cilp; > struct xfs_cil_ctx *ctx; > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index 24acdc54e44e..59778cd5ecdd 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -234,7 +234,7 @@ struct xfs_cil; > > struct xfs_cil_ctx { > struct xfs_cil *cil; > - xfs_lsn_t sequence; /* chkpt sequence # */ > + uint64_t sequence; /* chkpt sequence # */ > xfs_lsn_t start_lsn; /* first LSN of chkpt commit */ > xfs_lsn_t commit_lsn; /* chkpt commit record lsn */ > struct xlog_ticket *ticket; /* chkpt ticket */ > @@ -272,10 +272,10 @@ struct xfs_cil { > struct xfs_cil_ctx *xc_ctx; > > spinlock_t xc_push_lock ____cacheline_aligned_in_smp; > - xfs_lsn_t xc_push_seq; > + uint64_t xc_push_seq; > struct list_head xc_committing; > wait_queue_head_t xc_commit_wait; > - xfs_lsn_t xc_current_sequence; > + uint64_t xc_current_sequence; > struct work_struct xc_push_work; > wait_queue_head_t xc_push_wait; /* background push throttle */ > } ____cacheline_aligned_in_smp; > @@ -552,19 +552,18 @@ int xlog_cil_init(struct xlog *log); > void xlog_cil_init_post_recovery(struct xlog *log); > void xlog_cil_destroy(struct xlog *log); > bool xlog_cil_empty(struct xlog *log); > +void xlog_cil_commit(struct xlog *log, struct xfs_trans *tp, > + uint64_t *commit_seq, bool regrant); > > /* > * CIL force routines > */ > -xfs_lsn_t > -xlog_cil_force_lsn( > - struct xlog *log, > - xfs_lsn_t sequence); > +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence); > > static inline void > xlog_cil_force(struct xlog *log) > { > - xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence); > + xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence); > } > > /* > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 44f72c09c203..697703f3be48 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -849,7 +849,7 @@ __xfs_trans_commit( > bool regrant) > { > struct xfs_mount *mp = tp->t_mountp; > - xfs_lsn_t commit_lsn = -1; > + uint64_t commit_seq = 0; > int error = 0; > int sync = tp->t_flags & XFS_TRANS_SYNC; > > @@ -891,7 +891,7 @@ __xfs_trans_commit( > xfs_trans_apply_sb_deltas(tp); > xfs_trans_apply_dquot_deltas(tp); > > - xfs_log_commit_cil(mp, tp, &commit_lsn, regrant); > + xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant); > > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > xfs_trans_free(tp); > @@ -901,7 +901,7 @@ __xfs_trans_commit( > * log out now and wait for it. > */ > if (sync) { > - error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL); > + error = xfs_log_force_seq(mp, commit_seq, XFS_LOG_SYNC, NULL); > XFS_STATS_INC(mp, xs_trans_sync); > } else { > XFS_STATS_INC(mp, xs_trans_async); > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h > index 8b03fbfe9a1b..d223d4f4e429 100644 > --- a/fs/xfs/xfs_trans.h > +++ b/fs/xfs/xfs_trans.h > @@ -43,7 +43,7 @@ struct xfs_log_item { > struct list_head li_cil; /* CIL pointers */ > struct xfs_log_vec *li_lv; /* active log vector */ > struct xfs_log_vec *li_lv_shadow; /* standby vector */ > - xfs_lsn_t li_seq; /* CIL commit seq */ > + uint64_t li_seq; /* CIL commit seq */ > }; > > /* > @@ -69,7 +69,7 @@ struct xfs_item_ops { > void (*iop_pin)(struct xfs_log_item *); > void (*iop_unpin)(struct xfs_log_item *, int remove); > uint (*iop_push)(struct xfs_log_item *, struct list_head *); > - void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn); > + void (*iop_committing)(struct xfs_log_item *lip, uint64_t seq); > void (*iop_release)(struct xfs_log_item *); > xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t); > int (*iop_recover)(struct xfs_log_item *lip, > -- > 2.28.0 > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN 2021-02-24 21:42 ` Darrick J. Wong @ 2021-02-24 22:19 ` Dave Chinner 2021-02-25 19:01 ` Christoph Hellwig 1 sibling, 0 replies; 33+ messages in thread From: Dave Chinner @ 2021-02-24 22:19 UTC (permalink / raw) To: Darrick J. Wong; +Cc: linux-xfs On Wed, Feb 24, 2021 at 01:42:35PM -0800, Darrick J. Wong wrote: > On Tue, Feb 23, 2021 at 04:32:10PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > In doing an investigation into AIL push stalls, I was looking at the > > log force code to see if an async CIL push could be done instead. > > This lead me to xfs_log_force_lsn() and looking at how it works. > > > > xfs_log_force_lsn() is only called from inode synchronisation > > contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn > > value as the LSN to sync the log to. This gets passed to > > xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the > > journal, and then used by xfs_log_force_lsn() to flush the iclogs to > > the journal. > > > > The problem with is that ip->i_itemp->ili_last_lsn does not store a > > log sequence number. What it stores is passed to it from the > > ->iop_committing method, which is called by xfs_log_commit_cil(). > > The value this passes to the iop_committing method is the CIL > > context sequence number that the item was committed to. > > > > As it turns out, xlog_cil_force_lsn() converts the sequence to an > > actual commit LSN for the related context and returns that to > > xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn" > > variable that contained a sequence with an actual LSN and then uses > > that to sync the iclogs. > > > > This caused me some confusion for a while, even though I originally > > wrote all this code a decade ago. ->iop_committing is only used by > > a couple of log item types, and only inode items use the sequence > > number it is passed. > > > > Let's clean up the API, CIL structures and inode log item to call it > > a sequence number, and make it clear that the high level code is > > using CIL sequence numbers and not on-disk LSNs for integrity > > synchronisation purposes. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_buf_item.c | 2 +- > > fs/xfs/xfs_dquot_item.c | 2 +- > > fs/xfs/xfs_file.c | 14 +++++++------- > > fs/xfs/xfs_inode.c | 10 +++++----- > > fs/xfs/xfs_inode_item.c | 4 ++-- > > fs/xfs/xfs_inode_item.h | 2 +- > > fs/xfs/xfs_log.c | 27 ++++++++++++++------------- > > fs/xfs/xfs_log.h | 4 +--- > > fs/xfs/xfs_log_cil.c | 22 +++++++++------------- > > fs/xfs/xfs_log_priv.h | 15 +++++++-------- > > fs/xfs/xfs_trans.c | 6 +++--- > > fs/xfs/xfs_trans.h | 4 ++-- > > 12 files changed, 53 insertions(+), 59 deletions(-) > > > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c > > index 14d1fefcbf4c..7affe1aa16da 100644 > > --- a/fs/xfs/xfs_buf_item.c > > +++ b/fs/xfs/xfs_buf_item.c > > @@ -713,7 +713,7 @@ xfs_buf_item_release( > > STATIC void > > xfs_buf_item_committing( > > struct xfs_log_item *lip, > > - xfs_lsn_t commit_lsn) > > + uint64_t seq) > > FWIW I rather wish you'd defined a new type for cil sequence numbers, > since uint64_t is rather generic. Even if checkpatch whines about new > typedefs. I don't use checkpatch so I don't care about all the idiotic "crusade of the month" stuff it is always whining about. > I was kind of hoping that we'd be able to mark xfs_lsn_t and xfs_csn_t > with __bitwise and so static checkers could catch us if we accidentally > feed a CIL sequence number into a function that wants an LSN. Seems reasonable to hide that behind a typedef. No idea how to set it up and test it, though, and I don't really have time right now. I'll change it to a typedef to make this easier to do in future, though. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN 2021-02-24 21:42 ` Darrick J. Wong 2021-02-24 22:19 ` Dave Chinner @ 2021-02-25 19:01 ` Christoph Hellwig 1 sibling, 0 replies; 33+ messages in thread From: Christoph Hellwig @ 2021-02-25 19:01 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs On Wed, Feb 24, 2021 at 01:42:35PM -0800, Darrick J. Wong wrote: > FWIW I rather wish you'd defined a new type for cil sequence numbers, > since uint64_t is rather generic. Even if checkpatch whines about new > typedefs. > > I was kind of hoping that we'd be able to mark xfs_lsn_t and xfs_csn_t > with __bitwise and so static checkers could catch us if we accidentally > feed a CIL sequence number into a function that wants an LSN. __bitwise is a rather bad fit for xfs_lsn_t. It is used as scalar and not flags for which __bitwise is used, and it is written to disk in many places. Any in case you're asking: I tried to quickly convert it and noticed that the hard way. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN 2021-02-23 5:32 ` [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner 2021-02-24 21:42 ` Darrick J. Wong @ 2021-03-02 18:12 ` Brian Foster 1 sibling, 0 replies; 33+ messages in thread From: Brian Foster @ 2021-03-02 18:12 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Feb 23, 2021 at 04:32:10PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > In doing an investigation into AIL push stalls, I was looking at the > log force code to see if an async CIL push could be done instead. > This lead me to xfs_log_force_lsn() and looking at how it works. > > xfs_log_force_lsn() is only called from inode synchronisation > contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn > value as the LSN to sync the log to. This gets passed to > xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the > journal, and then used by xfs_log_force_lsn() to flush the iclogs to > the journal. > As you point out just below, this ->ili_last_lsn value is not used by xfs_log_force_lsn(). It's passed to the CIL force and that potentially returns a LSN for the log force call. > The problem with is that ip->i_itemp->ili_last_lsn does not store a > log sequence number. What it stores is passed to it from the > ->iop_committing method, which is called by xfs_log_commit_cil(). > The value this passes to the iop_committing method is the CIL > context sequence number that the item was committed to. > Ok, but what is the "problem?" It sounds like this patch is cleaning things up vs. fixing a problem... > As it turns out, xlog_cil_force_lsn() converts the sequence to an > actual commit LSN for the related context and returns that to > xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn" > variable that contained a sequence with an actual LSN and then uses > that to sync the iclogs. > > This caused me some confusion for a while, even though I originally > wrote all this code a decade ago. ->iop_committing is only used by > a couple of log item types, and only inode items use the sequence > number it is passed. > > Let's clean up the API, CIL structures and inode log item to call it > a sequence number, and make it clear that the high level code is > using CIL sequence numbers and not on-disk LSNs for integrity > synchronisation purposes. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_buf_item.c | 2 +- > fs/xfs/xfs_dquot_item.c | 2 +- > fs/xfs/xfs_file.c | 14 +++++++------- > fs/xfs/xfs_inode.c | 10 +++++----- > fs/xfs/xfs_inode_item.c | 4 ++-- > fs/xfs/xfs_inode_item.h | 2 +- > fs/xfs/xfs_log.c | 27 ++++++++++++++------------- > fs/xfs/xfs_log.h | 4 +--- > fs/xfs/xfs_log_cil.c | 22 +++++++++------------- > fs/xfs/xfs_log_priv.h | 15 +++++++-------- > fs/xfs/xfs_trans.c | 6 +++--- > fs/xfs/xfs_trans.h | 4 ++-- > 12 files changed, 53 insertions(+), 59 deletions(-) > ... > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index 08d68a6161ae..84cd9b6c6d1f 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c ... > @@ -3415,25 +3412,29 @@ __xfs_log_force_lsn( > * to disk, that thread will wake up all threads waiting on the queue. > */ > int > -xfs_log_force_lsn( > +xfs_log_force_seq( > struct xfs_mount *mp, > - xfs_lsn_t lsn, > + uint64_t seq, > uint flags, > int *log_flushed) > { > + struct xlog *log = mp->m_log; > + xfs_lsn_t lsn; > int ret; > - ASSERT(lsn != 0); > + ASSERT(seq != 0); > > XFS_STATS_INC(mp, xs_log_force); > - trace_xfs_log_force(mp, lsn, _RET_IP_); > + trace_xfs_log_force(mp, seq, _RET_IP_); > > - lsn = xlog_cil_force_lsn(mp->m_log, lsn); > + lsn = xlog_cil_force_seq(log, seq); > if (lsn == NULLCOMMITLSN) > return 0; > > - ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, false); > - if (ret == -EAGAIN) > - ret = __xfs_log_force_lsn(mp, lsn, flags, log_flushed, true); > + ret = xlog_force_lsn(log, lsn, flags, log_flushed, false); > + if (ret == -EAGAIN) { > + XFS_STATS_INC(mp, xs_log_force_sleep); This looks misplaced to me here. I'd just leave it where we do the sleep and pass log->l_mp. That nit and the slightly misleading commit log aside, the rest LGTM and seems like a nice cleanup: Reviewed-by: Brian Foster <bfoster@redhat.com> > + ret = xlog_force_lsn(log, lsn, flags, log_flushed, true); > + } > return ret; > } > > diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h > index 044e02cb8921..33ae53401060 100644 > --- a/fs/xfs/xfs_log.h > +++ b/fs/xfs/xfs_log.h > @@ -106,7 +106,7 @@ struct xfs_item_ops; > struct xfs_trans; > > int xfs_log_force(struct xfs_mount *mp, uint flags); > -int xfs_log_force_lsn(struct xfs_mount *mp, xfs_lsn_t lsn, uint flags, > +int xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags, > int *log_forced); > int xfs_log_mount(struct xfs_mount *mp, > struct xfs_buftarg *log_target, > @@ -132,8 +132,6 @@ bool xfs_log_writable(struct xfs_mount *mp); > struct xlog_ticket *xfs_log_ticket_get(struct xlog_ticket *ticket); > void xfs_log_ticket_put(struct xlog_ticket *ticket); > > -void xfs_log_commit_cil(struct xfs_mount *mp, struct xfs_trans *tp, > - xfs_lsn_t *commit_lsn, bool regrant); > void xlog_cil_process_committed(struct list_head *list); > bool xfs_log_item_in_current_chkpt(struct xfs_log_item *lip); > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 0a00c3c9610c..2fda8c4513b2 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -792,7 +792,7 @@ xlog_cil_push_work( > * that higher sequences will wait for us to write out a commit record > * before they do. > * > - * xfs_log_force_lsn requires us to mirror the new sequence into the cil > + * xfs_log_force_seq requires us to mirror the new sequence into the cil > * structure atomically with the addition of this sequence to the > * committing list. This also ensures that we can do unlocked checks > * against the current sequence in log forces without risking > @@ -1054,16 +1054,14 @@ xlog_cil_empty( > * allowed again. > */ > void > -xfs_log_commit_cil( > - struct xfs_mount *mp, > +xlog_cil_commit( > + struct xlog *log, > struct xfs_trans *tp, > - xfs_lsn_t *commit_lsn, > + uint64_t *commit_seq, > bool regrant) > { > - struct xlog *log = mp->m_log; > struct xfs_cil *cil = log->l_cilp; > struct xfs_log_item *lip, *next; > - xfs_lsn_t xc_commit_lsn; > > /* > * Do all necessary memory allocation before we lock the CIL. > @@ -1077,10 +1075,6 @@ xfs_log_commit_cil( > > xlog_cil_insert_items(log, tp); > > - xc_commit_lsn = cil->xc_ctx->sequence; > - if (commit_lsn) > - *commit_lsn = xc_commit_lsn; > - > if (regrant && !XLOG_FORCED_SHUTDOWN(log)) > xfs_log_ticket_regrant(log, tp->t_ticket); > else > @@ -1103,8 +1097,10 @@ xfs_log_commit_cil( > list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) { > xfs_trans_del_item(lip); > if (lip->li_ops->iop_committing) > - lip->li_ops->iop_committing(lip, xc_commit_lsn); > + lip->li_ops->iop_committing(lip, cil->xc_ctx->sequence); > } > + if (commit_seq) > + *commit_seq = cil->xc_ctx->sequence; > > /* xlog_cil_push_background() releases cil->xc_ctx_lock */ > xlog_cil_push_background(log); > @@ -1121,9 +1117,9 @@ xfs_log_commit_cil( > * iclog flush is necessary following this call. > */ > xfs_lsn_t > -xlog_cil_force_lsn( > +xlog_cil_force_seq( > struct xlog *log, > - xfs_lsn_t sequence) > + uint64_t sequence) > { > struct xfs_cil *cil = log->l_cilp; > struct xfs_cil_ctx *ctx; > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index 24acdc54e44e..59778cd5ecdd 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -234,7 +234,7 @@ struct xfs_cil; > > struct xfs_cil_ctx { > struct xfs_cil *cil; > - xfs_lsn_t sequence; /* chkpt sequence # */ > + uint64_t sequence; /* chkpt sequence # */ > xfs_lsn_t start_lsn; /* first LSN of chkpt commit */ > xfs_lsn_t commit_lsn; /* chkpt commit record lsn */ > struct xlog_ticket *ticket; /* chkpt ticket */ > @@ -272,10 +272,10 @@ struct xfs_cil { > struct xfs_cil_ctx *xc_ctx; > > spinlock_t xc_push_lock ____cacheline_aligned_in_smp; > - xfs_lsn_t xc_push_seq; > + uint64_t xc_push_seq; > struct list_head xc_committing; > wait_queue_head_t xc_commit_wait; > - xfs_lsn_t xc_current_sequence; > + uint64_t xc_current_sequence; > struct work_struct xc_push_work; > wait_queue_head_t xc_push_wait; /* background push throttle */ > } ____cacheline_aligned_in_smp; > @@ -552,19 +552,18 @@ int xlog_cil_init(struct xlog *log); > void xlog_cil_init_post_recovery(struct xlog *log); > void xlog_cil_destroy(struct xlog *log); > bool xlog_cil_empty(struct xlog *log); > +void xlog_cil_commit(struct xlog *log, struct xfs_trans *tp, > + uint64_t *commit_seq, bool regrant); > > /* > * CIL force routines > */ > -xfs_lsn_t > -xlog_cil_force_lsn( > - struct xlog *log, > - xfs_lsn_t sequence); > +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence); > > static inline void > xlog_cil_force(struct xlog *log) > { > - xlog_cil_force_lsn(log, log->l_cilp->xc_current_sequence); > + xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence); > } > > /* > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 44f72c09c203..697703f3be48 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -849,7 +849,7 @@ __xfs_trans_commit( > bool regrant) > { > struct xfs_mount *mp = tp->t_mountp; > - xfs_lsn_t commit_lsn = -1; > + uint64_t commit_seq = 0; > int error = 0; > int sync = tp->t_flags & XFS_TRANS_SYNC; > > @@ -891,7 +891,7 @@ __xfs_trans_commit( > xfs_trans_apply_sb_deltas(tp); > xfs_trans_apply_dquot_deltas(tp); > > - xfs_log_commit_cil(mp, tp, &commit_lsn, regrant); > + xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant); > > current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS); > xfs_trans_free(tp); > @@ -901,7 +901,7 @@ __xfs_trans_commit( > * log out now and wait for it. > */ > if (sync) { > - error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL); > + error = xfs_log_force_seq(mp, commit_seq, XFS_LOG_SYNC, NULL); > XFS_STATS_INC(mp, xs_trans_sync); > } else { > XFS_STATS_INC(mp, xs_trans_async); > diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h > index 8b03fbfe9a1b..d223d4f4e429 100644 > --- a/fs/xfs/xfs_trans.h > +++ b/fs/xfs/xfs_trans.h > @@ -43,7 +43,7 @@ struct xfs_log_item { > struct list_head li_cil; /* CIL pointers */ > struct xfs_log_vec *li_lv; /* active log vector */ > struct xfs_log_vec *li_lv_shadow; /* standby vector */ > - xfs_lsn_t li_seq; /* CIL commit seq */ > + uint64_t li_seq; /* CIL commit seq */ > }; > > /* > @@ -69,7 +69,7 @@ struct xfs_item_ops { > void (*iop_pin)(struct xfs_log_item *); > void (*iop_unpin)(struct xfs_log_item *, int remove); > uint (*iop_push)(struct xfs_log_item *, struct list_head *); > - void (*iop_committing)(struct xfs_log_item *, xfs_lsn_t commit_lsn); > + void (*iop_committing)(struct xfs_log_item *lip, uint64_t seq); > void (*iop_release)(struct xfs_log_item *); > xfs_lsn_t (*iop_committed)(struct xfs_log_item *, xfs_lsn_t); > int (*iop_recover)(struct xfs_log_item *lip, > -- > 2.28.0 > ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing 2021-02-23 5:32 [PATCH 0/3] xfs: CIL improvements Dave Chinner 2021-02-23 5:32 ` [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner @ 2021-02-23 5:32 ` Dave Chinner 2021-02-24 21:10 ` Dave Chinner 2021-02-23 5:32 ` [PATCH 3/3] xfs: CIL work is serialised, not pipelined Dave Chinner 2 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2021-02-23 5:32 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> The AIL pushing is stalling on log forces when it comes across pinned items. This is happening on removal workloads where the AIL is dominated by stale items that are removed from AIL when the checkpoint that marks the items stale is committed to the journal. This results is relatively few items in the AIL, but those that are are often pinned as directories items are being removed from are still being logged. As a result, many push cycles through the CIL will first issue a blocking log force to unpin the items. This can take some time to complete, with tracing regularly showing push delays of half a second and sometimes up into the range of several seconds. Sequences like this aren't uncommon: .... 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 270ms delay> 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 400.099623: xfsaild: first lsn 0x11002f3600 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 <wanted 50ms, got 500ms delay> 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 400.589349: xfsaild: first lsn 0x1100305000 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 <wanted 50ms, got 460ms delay> 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 400.950343: xfsaild: first lsn 0x1100317c00 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 <wanted 20ms, got 200ms delay> 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 401.142334: xfsaild: first lsn 0x110032e600 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 <wanted 10ms, got 10ms delay> 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 401.154328: xfsaild: first lsn 0x1100353000 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 300ms delay> 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 401.451526: xfsaild: first lsn 0x1100353000 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 <wanted 50ms, got 500ms delay> 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 .... In each of these cases, every AIL pass saw 101 log items stuck on the AIL (pinned) with very few other items being found. Each pass, a log force was issued, and delay between last/first is the sleep time + the sync log force time. Some of these 101 items pinned the tail of the log. The tail of the log does slowly creep forward (first lsn), but the problem is that the log is actually out of reservation space because it's been running so many transactions that stale items that never reach the AIL but consume log space. Hence we have a largely empty AIL, with long term pins on items that pin the tail of the log that don't get pushed frequently enough to keep log space available. The problem is the hundreds of milliseconds that we block in the log force pushing the CIL out to disk. The AIL should not be stalled like this - it needs to run and flush items that are at the tail of the log with minimal latency. What we really need to do is trigger a log flush, but then not wait for it at all - we've already done our waiting for stuff to complete when we backed off prior to the log force being issued. Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we still do a blocking flush of the CIL and that is what is causing the issue. Hence we need a new interface for the CIL to trigger an immediate background push of the CIL to get it moving faster but not to wait on that to occur. While the CIL is pushing, the AIL can also be pushing. We already have an internal interface to do this - xlog_cil_push_now() - but we need a wrapper for it to be used externally. xlog_cil_force_seq() can easily be extended to do what we need as it already implements the synchronous CIL push via xlog_cil_push_now(). Add the necessary flags and "push current sequence" semantics to xlog_cil_force_seq() and convert the AIL pushing to use it. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log.c | 33 ++++++++++++++++++++++----------- fs/xfs/xfs_log.h | 1 + fs/xfs/xfs_log_cil.c | 29 +++++++++++++++++++++-------- fs/xfs/xfs_log_priv.h | 5 +++-- fs/xfs/xfs_sysfs.c | 1 + fs/xfs/xfs_trace.c | 1 + fs/xfs/xfs_trans.c | 2 +- fs/xfs/xfs_trans_ail.c | 11 ++++++++--- 8 files changed, 58 insertions(+), 25 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 84cd9b6c6d1f..a0ecaf4f9b77 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -3335,6 +3335,20 @@ xfs_log_force( return -EIO; } +/* + * Force the log to a specific LSN. + * + * If an iclog with that lsn can be found: + * If it is in the DIRTY state, just return. + * If it is in the ACTIVE state, move the in-core log into the WANT_SYNC + * state and go to sleep or return. + * If it is in any other state, go to sleep or return. + * + * Synchronous forces are implemented with a wait queue. All callers trying + * to force a given lsn to disk must wait on the queue attached to the + * specific in-core log. When given in-core log finally completes its write + * to disk, that thread will wake up all threads waiting on the queue. + */ static int xlog_force_lsn( struct xlog *log, @@ -3398,18 +3412,15 @@ xlog_force_lsn( } /* - * Force the in-core log to disk for a specific LSN. + * Force the log to a specific checkpoint sequence. * - * Find in-core log with lsn. - * If it is in the DIRTY state, just return. - * If it is in the ACTIVE state, move the in-core log into the WANT_SYNC - * state and go to sleep or return. - * If it is in any other state, go to sleep or return. + * First force the CIL so that all the required changes have been flushed to the + * iclogs. If the CIL force completed it will return a commit LSN that indicates + * the iclog that needs to be flushed to stable storage. * - * Synchronous forces are implemented with a wait queue. All callers trying - * to force a given lsn to disk must wait on the queue attached to the - * specific in-core log. When given in-core log finally completes its write - * to disk, that thread will wake up all threads waiting on the queue. + * If the XFS_LOG_SYNC flag is not set, we only trigger a background CIL force + * and do not wait for it to complete, nor do we attempt to check/flush iclogs + * as the CIL will not have committed when xlog_cil_force_seq() returns. */ int xfs_log_force_seq( @@ -3426,7 +3437,7 @@ xfs_log_force_seq( XFS_STATS_INC(mp, xs_log_force); trace_xfs_log_force(mp, seq, _RET_IP_); - lsn = xlog_cil_force_seq(log, seq); + lsn = xlog_cil_force_seq(log, flags, seq); if (lsn == NULLCOMMITLSN) return 0; diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 33ae53401060..77d3fca28f55 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -104,6 +104,7 @@ struct xlog_ticket; struct xfs_log_item; struct xfs_item_ops; struct xfs_trans; +struct xlog; int xfs_log_force(struct xfs_mount *mp, uint flags); int xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags, diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 2fda8c4513b2..be6c91d7218f 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -999,7 +999,8 @@ xlog_cil_push_background( static void xlog_cil_push_now( struct xlog *log, - xfs_lsn_t push_seq) + xfs_lsn_t push_seq, + bool flush) { struct xfs_cil *cil = log->l_cilp; @@ -1009,7 +1010,8 @@ xlog_cil_push_now( ASSERT(push_seq && push_seq <= cil->xc_current_sequence); /* start on any pending background push to minimise wait time on it */ - flush_work(&cil->xc_push_work); + if (flush) + flush_work(&cil->xc_push_work); /* * If the CIL is empty or we've already pushed the sequence then @@ -1109,16 +1111,22 @@ xlog_cil_commit( /* * Conditionally push the CIL based on the sequence passed in. * - * We only need to push if we haven't already pushed the sequence - * number given. Hence the only time we will trigger a push here is - * if the push sequence is the same as the current context. + * We only need to push if we haven't already pushed the sequence number given. + * Hence the only time we will trigger a push here is if the push sequence is + * the same as the current context. * - * We return the current commit lsn to allow the callers to determine if a - * iclog flush is necessary following this call. + * If the sequence is zero, push the current sequence. If XFS_LOG_SYNC is set in + * the flags wait for it to complete, otherwise jsut return NULLCOMMITLSN to + * indicate we didn't wait for a commit lsn. + * + * If we waited for the push to complete, then we return the current commit lsn + * to allow the callers to determine if a iclog flush is necessary following + * this call. */ xfs_lsn_t xlog_cil_force_seq( struct xlog *log, + uint32_t flags, uint64_t sequence) { struct xfs_cil *cil = log->l_cilp; @@ -1127,13 +1135,18 @@ xlog_cil_force_seq( ASSERT(sequence <= cil->xc_current_sequence); + if (!sequence) + sequence = cil->xc_current_sequence; + /* * check to see if we need to force out the current context. * xlog_cil_push() handles racing pushes for the same sequence, * so no need to deal with it here. */ restart: - xlog_cil_push_now(log, sequence); + xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC); + if (!(flags & XFS_LOG_SYNC)) + return commit_lsn; /* * See if we can find a previous sequence still committing. diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 59778cd5ecdd..d9929d9146e4 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -558,12 +558,13 @@ void xlog_cil_commit(struct xlog *log, struct xfs_trans *tp, /* * CIL force routines */ -xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence); +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags, + uint64_t sequence); static inline void xlog_cil_force(struct xlog *log) { - xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence); + xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence); } /* diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c index f1bc88f4367c..18dc5eca6c04 100644 --- a/fs/xfs/xfs_sysfs.c +++ b/fs/xfs/xfs_sysfs.c @@ -10,6 +10,7 @@ #include "xfs_log_format.h" #include "xfs_trans_resv.h" #include "xfs_sysfs.h" +#include "xfs_log.h" #include "xfs_log_priv.h" #include "xfs_mount.h" diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c index 9b8d703dc9fd..d111a994b7b6 100644 --- a/fs/xfs/xfs_trace.c +++ b/fs/xfs/xfs_trace.c @@ -20,6 +20,7 @@ #include "xfs_bmap.h" #include "xfs_attr.h" #include "xfs_trans.h" +#include "xfs_log.h" #include "xfs_log_priv.h" #include "xfs_buf_item.h" #include "xfs_quota.h" diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 697703f3be48..7d05694681e3 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -9,7 +9,6 @@ #include "xfs_shared.h" #include "xfs_format.h" #include "xfs_log_format.h" -#include "xfs_log_priv.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_extent_busy.h" @@ -17,6 +16,7 @@ #include "xfs_trans.h" #include "xfs_trans_priv.h" #include "xfs_log.h" +#include "xfs_log_priv.h" #include "xfs_trace.h" #include "xfs_error.h" #include "xfs_defer.h" diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index dbb69b4bf3ed..dfc0206c0d36 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -17,6 +17,7 @@ #include "xfs_errortag.h" #include "xfs_error.h" #include "xfs_log.h" +#include "xfs_log_priv.h" #ifdef DEBUG /* @@ -429,8 +430,12 @@ xfsaild_push( /* * If we encountered pinned items or did not finish writing out all - * buffers the last time we ran, force the log first and wait for it - * before pushing again. + * buffers the last time we ran, force a background CIL push to get the + * items unpinned in the near future. We do not wait on the CIL push as + * that could stall us for seconds if there is enough background IO + * load. Stalling for that long when the tail of the log is pinned and + * needs flushing will hard stop the transaction subsystem when log + * space runs out. */ if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 && (!list_empty_careful(&ailp->ail_buf_list) || @@ -438,7 +443,7 @@ xfsaild_push( ailp->ail_log_flush = 0; XFS_STATS_INC(mp, xs_push_ail_flush); - xfs_log_force(mp, XFS_LOG_SYNC); + xlog_cil_force_seq(mp->m_log, 0, 0); } spin_lock(&ailp->ail_lock); -- 2.28.0 ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing 2021-02-23 5:32 ` [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing Dave Chinner @ 2021-02-24 21:10 ` Dave Chinner 2021-02-24 23:26 ` [PATCH 2/3 v2] " Dave Chinner 2021-02-25 13:12 ` [PATCH 2/3] " Brian Foster 0 siblings, 2 replies; 33+ messages in thread From: Dave Chinner @ 2021-02-24 21:10 UTC (permalink / raw) To: linux-xfs On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The AIL pushing is stalling on log forces when it comes across > pinned items. This is happening on removal workloads where the AIL > is dominated by stale items that are removed from AIL when the > checkpoint that marks the items stale is committed to the journal. > This results is relatively few items in the AIL, but those that are > are often pinned as directories items are being removed from are > still being logged. > > As a result, many push cycles through the CIL will first issue a > blocking log force to unpin the items. This can take some time to > complete, with tracing regularly showing push delays of half a > second and sometimes up into the range of several seconds. Sequences > like this aren't uncommon: > > .... > 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 > <wanted 20ms, got 270ms delay> > 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 > 400.099623: xfsaild: first lsn 0x11002f3600 > 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 > <wanted 50ms, got 500ms delay> > 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 > 400.589349: xfsaild: first lsn 0x1100305000 > 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 > <wanted 50ms, got 460ms delay> > 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 > 400.950343: xfsaild: first lsn 0x1100317c00 > 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 > <wanted 20ms, got 200ms delay> > 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 > 401.142334: xfsaild: first lsn 0x110032e600 > 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 > <wanted 10ms, got 10ms delay> > 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 > 401.154328: xfsaild: first lsn 0x1100353000 > 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 > <wanted 20ms, got 300ms delay> > 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 > 401.451526: xfsaild: first lsn 0x1100353000 > 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 > <wanted 50ms, got 500ms delay> > 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 > .... > > In each of these cases, every AIL pass saw 101 log items stuck on > the AIL (pinned) with very few other items being found. Each pass, a > log force was issued, and delay between last/first is the sleep time > + the sync log force time. > > Some of these 101 items pinned the tail of the log. The tail of the > log does slowly creep forward (first lsn), but the problem is that > the log is actually out of reservation space because it's been > running so many transactions that stale items that never reach the > AIL but consume log space. Hence we have a largely empty AIL, with > long term pins on items that pin the tail of the log that don't get > pushed frequently enough to keep log space available. > > The problem is the hundreds of milliseconds that we block in the log > force pushing the CIL out to disk. The AIL should not be stalled > like this - it needs to run and flush items that are at the tail of > the log with minimal latency. What we really need to do is trigger a > log flush, but then not wait for it at all - we've already done our > waiting for stuff to complete when we backed off prior to the log > force being issued. > > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we > still do a blocking flush of the CIL and that is what is causing the > issue. Hence we need a new interface for the CIL to trigger an > immediate background push of the CIL to get it moving faster but not > to wait on that to occur. While the CIL is pushing, the AIL can also > be pushing. > > We already have an internal interface to do this - > xlog_cil_push_now() - but we need a wrapper for it to be used > externally. xlog_cil_force_seq() can easily be extended to do what > we need as it already implements the synchronous CIL push via > xlog_cil_push_now(). Add the necessary flags and "push current > sequence" semantics to xlog_cil_force_seq() and convert the AIL > pushing to use it. Overnight testing indicates generic/530 hangs on small logs with this patch. It essentially runs out of log space because there are inode cluster buffers permanently pinned by this workload. That is, as inodes are unlinked, they repeatedly relog the inode cluster buffer, and so while the CIL is flushing to unpin the buffer, a new unlink relogs it and adds a new pin to the buffer. Hence the log force that the AIL does to unpin pinned items doesn't actually unpin buffers that are being relogged. These buffers only get unpinned when the log actually runs out of space because they pin the tail of the log. Then the modifications that are relogging the buffer stop because they fail to get a log reservation, and the tail of the log does not move forward until the AIL issues a log force that finally unpins the buffer and writes it back. Essentially, relogged buffers cannot be flushed by the AIL until the log completely stalls. The problem being tripped over here is that we no longer force the final iclogs in a CIL push to disk - we leave the iclog with the commit record in it in ACTIVE state, and by not waiting and forcing all the iclogs to disk, the buffer never gets unpinned because there isn't any more log pressure to force it out because everything is blocked on reservation space. The solution to this is to have the CIL push change the state of the commit iclog to WANT_SYNC before it is released. This means the CIL push will always flush the iclog to disk and the checkpoint will complete and unpin the buffers. Right now, we really only want to do this state switch for these async pushes - for small sync transactions and fsync we really want the iclog aggregation that we have now to optimise iclogbuf usage, so I'll have to pass a new flag through the push code and back into xlog_write(). That will make this patch behave the same as we currently do. In the longer term, we need to change how the AIL deals with pinned buffers, as the current method is definitely sub-optimal. It also explains the "everything stops for a second" performance anomalies I've been seeing for a while in testing. But fixing that is outside the scope of this patchset, so in the mean time I'll fix this one up and repost it in a little while. FWIW, this is also the likely cause of the "CIL workqueue too deep" hangs I was seeing with the next patch in the series, too. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing 2021-02-24 21:10 ` Dave Chinner @ 2021-02-24 23:26 ` Dave Chinner 2021-02-25 2:15 ` Dave Chinner 2021-03-02 21:44 ` Brian Foster 2021-02-25 13:12 ` [PATCH 2/3] " Brian Foster 1 sibling, 2 replies; 33+ messages in thread From: Dave Chinner @ 2021-02-24 23:26 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> The AIL pushing is stalling on log forces when it comes across pinned items. This is happening on removal workloads where the AIL is dominated by stale items that are removed from AIL when the checkpoint that marks the items stale is committed to the journal. This results is relatively few items in the AIL, but those that are are often pinned as directories items are being removed from are still being logged. As a result, many push cycles through the CIL will first issue a blocking log force to unpin the items. This can take some time to complete, with tracing regularly showing push delays of half a second and sometimes up into the range of several seconds. Sequences like this aren't uncommon: .... 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 270ms delay> 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 400.099623: xfsaild: first lsn 0x11002f3600 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 <wanted 50ms, got 500ms delay> 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 400.589349: xfsaild: first lsn 0x1100305000 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 <wanted 50ms, got 460ms delay> 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 400.950343: xfsaild: first lsn 0x1100317c00 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 <wanted 20ms, got 200ms delay> 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 401.142334: xfsaild: first lsn 0x110032e600 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 <wanted 10ms, got 10ms delay> 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 401.154328: xfsaild: first lsn 0x1100353000 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 300ms delay> 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 401.451526: xfsaild: first lsn 0x1100353000 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 <wanted 50ms, got 500ms delay> 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 .... In each of these cases, every AIL pass saw 101 log items stuck on the AIL (pinned) with very few other items being found. Each pass, a log force was issued, and delay between last/first is the sleep time + the sync log force time. Some of these 101 items pinned the tail of the log. The tail of the log does slowly creep forward (first lsn), but the problem is that the log is actually out of reservation space because it's been running so many transactions that stale items that never reach the AIL but consume log space. Hence we have a largely empty AIL, with long term pins on items that pin the tail of the log that don't get pushed frequently enough to keep log space available. The problem is the hundreds of milliseconds that we block in the log force pushing the CIL out to disk. The AIL should not be stalled like this - it needs to run and flush items that are at the tail of the log with minimal latency. What we really need to do is trigger a log flush, but then not wait for it at all - we've already done our waiting for stuff to complete when we backed off prior to the log force being issued. Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we still do a blocking flush of the CIL and that is what is causing the issue. Hence we need a new interface for the CIL to trigger an immediate background push of the CIL to get it moving faster but not to wait on that to occur. While the CIL is pushing, the AIL can also be pushing. We already have an internal interface to do this - xlog_cil_push_now() - but we need a wrapper for it to be used externally. xlog_cil_force_seq() can easily be extended to do what we need as it already implements the synchronous CIL push via xlog_cil_push_now(). Add the necessary flags and "push current sequence" semantics to xlog_cil_force_seq() and convert the AIL pushing to use it. One of the complexities here is that the CIL push does not guarantee that the commit record for the CIL checkpoint is written to disk. The current log force ensures this by submitting the current ACTIVE iclog that the commit record was written to. We need the CIL to actually write this commit record to disk for an async push to ensure that the checkpoint actually makes it to disk and unpins the pinned items in the checkpoint on completion. Hence we need to pass down to the CIL push that we are doing an async flush so that it can switch out the commit_iclog if necessary to get written to disk when the commit iclog is finally released. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- Version 2: - ensure the CIL checkpoint issues the commit record to disk for an async push. Fixes generic/530 hang on small logs. - increment log force stats when the CIL is forced and also when it sleeps to give insight into the amount of blocking being done when the CIL is forced. fs/xfs/xfs_log.c | 59 +++++++++++++++++++--------------------------- fs/xfs/xfs_log.h | 2 +- fs/xfs/xfs_log_cil.c | 64 +++++++++++++++++++++++++++++++++++++++++--------- fs/xfs/xfs_log_priv.h | 10 ++++++-- fs/xfs/xfs_sysfs.c | 1 + fs/xfs/xfs_trace.c | 1 + fs/xfs/xfs_trans.c | 2 +- fs/xfs/xfs_trans_ail.c | 11 ++++++--- 8 files changed, 97 insertions(+), 53 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 84cd9b6c6d1f..51acc0075699 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -50,11 +50,6 @@ xlog_state_get_iclog_space( int *continued_write, int *logoffsetp); STATIC void -xlog_state_switch_iclogs( - struct xlog *log, - struct xlog_in_core *iclog, - int eventual_size); -STATIC void xlog_grant_push_ail( struct xlog *log, int need_bytes); @@ -511,7 +506,7 @@ __xlog_state_release_iclog( * Flush iclog to disk if this is the last reference to the given iclog and the * it is in the WANT_SYNC state. */ -static int +int xlog_state_release_iclog( struct xlog *log, struct xlog_in_core *iclog) @@ -531,23 +526,6 @@ xlog_state_release_iclog( return 0; } -void -xfs_log_release_iclog( - struct xlog_in_core *iclog) -{ - struct xlog *log = iclog->ic_log; - bool sync = false; - - if (atomic_dec_and_lock(&iclog->ic_refcnt, &log->l_icloglock)) { - if (iclog->ic_state != XLOG_STATE_IOERROR) - sync = __xlog_state_release_iclog(log, iclog); - spin_unlock(&log->l_icloglock); - } - - if (sync) - xlog_sync(log, iclog); -} - /* * Mount a log filesystem * @@ -3188,7 +3166,7 @@ xfs_log_ticket_ungrant( * This routine will mark the current iclog in the ring as WANT_SYNC and move * the current iclog pointer to the next iclog in the ring. */ -STATIC void +void xlog_state_switch_iclogs( struct xlog *log, struct xlog_in_core *iclog, @@ -3335,6 +3313,20 @@ xfs_log_force( return -EIO; } +/* + * Force the log to a specific LSN. + * + * If an iclog with that lsn can be found: + * If it is in the DIRTY state, just return. + * If it is in the ACTIVE state, move the in-core log into the WANT_SYNC + * state and go to sleep or return. + * If it is in any other state, go to sleep or return. + * + * Synchronous forces are implemented with a wait queue. All callers trying + * to force a given lsn to disk must wait on the queue attached to the + * specific in-core log. When given in-core log finally completes its write + * to disk, that thread will wake up all threads waiting on the queue. + */ static int xlog_force_lsn( struct xlog *log, @@ -3398,18 +3390,15 @@ xlog_force_lsn( } /* - * Force the in-core log to disk for a specific LSN. + * Force the log to a specific checkpoint sequence. * - * Find in-core log with lsn. - * If it is in the DIRTY state, just return. - * If it is in the ACTIVE state, move the in-core log into the WANT_SYNC - * state and go to sleep or return. - * If it is in any other state, go to sleep or return. + * First force the CIL so that all the required changes have been flushed to the + * iclogs. If the CIL force completed it will return a commit LSN that indicates + * the iclog that needs to be flushed to stable storage. * - * Synchronous forces are implemented with a wait queue. All callers trying - * to force a given lsn to disk must wait on the queue attached to the - * specific in-core log. When given in-core log finally completes its write - * to disk, that thread will wake up all threads waiting on the queue. + * If the XFS_LOG_SYNC flag is not set, we only trigger a background CIL force + * and do not wait for it to complete, nor do we attempt to check/flush iclogs + * as the CIL will not have committed when xlog_cil_force_seq() returns. */ int xfs_log_force_seq( @@ -3426,7 +3415,7 @@ xfs_log_force_seq( XFS_STATS_INC(mp, xs_log_force); trace_xfs_log_force(mp, seq, _RET_IP_); - lsn = xlog_cil_force_seq(log, seq); + lsn = xlog_cil_force_seq(log, flags, seq); if (lsn == NULLCOMMITLSN) return 0; diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 33ae53401060..ff480d372dcf 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -104,6 +104,7 @@ struct xlog_ticket; struct xfs_log_item; struct xfs_item_ops; struct xfs_trans; +struct xlog; int xfs_log_force(struct xfs_mount *mp, uint flags); int xfs_log_force_seq(struct xfs_mount *mp, uint64_t seq, uint flags, @@ -117,7 +118,6 @@ void xfs_log_mount_cancel(struct xfs_mount *); xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp); xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp); void xfs_log_space_wake(struct xfs_mount *mp); -void xfs_log_release_iclog(struct xlog_in_core *iclog); int xfs_log_reserve(struct xfs_mount *mp, int length, int count, diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index 2fda8c4513b2..1b46559ef64a 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -657,6 +657,7 @@ xlog_cil_push_work( xfs_lsn_t commit_lsn; xfs_lsn_t push_seq; DECLARE_COMPLETION_ONSTACK(bdev_flush); + bool commit_iclog_sync = false; new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS); new_ctx->ticket = xlog_cil_ticket_alloc(log); @@ -667,7 +668,12 @@ xlog_cil_push_work( spin_lock(&cil->xc_push_lock); push_seq = cil->xc_push_seq; ASSERT(push_seq <= ctx->sequence); + commit_iclog_sync = cil->xc_push_async; + cil->xc_push_async = false; +trace_printk("curseq %llu, ctxseq %llu pushseq %llu empty %d", + cil->xc_current_sequence, cil->xc_ctx->sequence, + cil->xc_push_seq, list_empty(&cil->xc_cil)); /* * As we are about to switch to a new, empty CIL context, we no longer * need to throttle tasks on CIL space overruns. Wake any waiters that @@ -910,7 +916,11 @@ xlog_cil_push_work( commit_iclog->ic_flags &= ~XLOG_ICL_NEED_FLUSH; /* release the hounds! */ - xfs_log_release_iclog(commit_iclog); + spin_lock(&log->l_icloglock); + if (commit_iclog_sync && commit_iclog->ic_state == XLOG_STATE_ACTIVE) + xlog_state_switch_iclogs(log, commit_iclog, 0); + xlog_state_release_iclog(log, commit_iclog); + spin_unlock(&log->l_icloglock); return; out_skip: @@ -993,13 +1003,26 @@ xlog_cil_push_background( /* * xlog_cil_push_now() is used to trigger an immediate CIL push to the sequence * number that is passed. When it returns, the work will be queued for - * @push_seq, but it won't be completed. The caller is expected to do any - * waiting for push_seq to complete if it is required. + * @push_seq, but it won't be completed. + * + * If the caller is performing a synchronous force, we will flush the workqueue + * to get previously queued work moving to minimise the wait time they will + * undergo waiting for all outstanding pushes to complete. The caller is + * expected to do the required waiting for push_seq to complete. + * + * If the caller is performing an async push, we need to ensure that the + * checkpoint is fully flushed out of the iclogs when we finish the push. If we + * don't do this, then the commit record may remain sitting in memory in an + * ACTIVE iclog. This then requires another full log force to push to disk, + * which defeats the purpose of having an async, non-blocking CIL force + * mechanism. Hence in this case we need to pass a flag to the push work to + * indicate it needs to flush the commit record itself. */ static void xlog_cil_push_now( struct xlog *log, - xfs_lsn_t push_seq) + xfs_lsn_t push_seq, + bool sync) { struct xfs_cil *cil = log->l_cilp; @@ -1009,7 +1032,8 @@ xlog_cil_push_now( ASSERT(push_seq && push_seq <= cil->xc_current_sequence); /* start on any pending background push to minimise wait time on it */ - flush_work(&cil->xc_push_work); + if (sync) + flush_work(&cil->xc_push_work); /* * If the CIL is empty or we've already pushed the sequence then @@ -1022,6 +1046,8 @@ xlog_cil_push_now( } cil->xc_push_seq = push_seq; + if (!sync) + cil->xc_push_async = true; queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); spin_unlock(&cil->xc_push_lock); } @@ -1109,16 +1135,22 @@ xlog_cil_commit( /* * Conditionally push the CIL based on the sequence passed in. * - * We only need to push if we haven't already pushed the sequence - * number given. Hence the only time we will trigger a push here is - * if the push sequence is the same as the current context. + * We only need to push if we haven't already pushed the sequence number given. + * Hence the only time we will trigger a push here is if the push sequence is + * the same as the current context. * - * We return the current commit lsn to allow the callers to determine if a - * iclog flush is necessary following this call. + * If the sequence is zero, push the current sequence. If XFS_LOG_SYNC is set in + * the flags wait for it to complete, otherwise jsut return NULLCOMMITLSN to + * indicate we didn't wait for a commit lsn. + * + * If we waited for the push to complete, then we return the current commit lsn + * to allow the callers to determine if a iclog flush is necessary following + * this call. */ xfs_lsn_t xlog_cil_force_seq( struct xlog *log, + uint32_t flags, uint64_t sequence) { struct xfs_cil *cil = log->l_cilp; @@ -1127,13 +1159,22 @@ xlog_cil_force_seq( ASSERT(sequence <= cil->xc_current_sequence); + if (!sequence) + sequence = cil->xc_current_sequence; +trace_printk("seq %llu, curseq %llu, ctxseq %llu pushseq %llu flags 0x%x", + sequence, cil->xc_current_sequence, cil->xc_ctx->sequence, + cil->xc_push_seq, flags); + + trace_xfs_log_force(log->l_mp, sequence, _RET_IP_); /* * check to see if we need to force out the current context. * xlog_cil_push() handles racing pushes for the same sequence, * so no need to deal with it here. */ restart: - xlog_cil_push_now(log, sequence); + xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC); + if (!(flags & XFS_LOG_SYNC)) + return commit_lsn; /* * See if we can find a previous sequence still committing. @@ -1157,6 +1198,7 @@ xlog_cil_force_seq( * It is still being pushed! Wait for the push to * complete, then start again from the beginning. */ + XFS_STATS_INC(log->l_mp, xs_log_force_sleep); xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock); goto restart; } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index 7e8ec77bc6e6..6b1a251dd013 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -273,6 +273,7 @@ struct xfs_cil { spinlock_t xc_push_lock ____cacheline_aligned_in_smp; uint64_t xc_push_seq; + bool xc_push_async; struct list_head xc_committing; wait_queue_head_t xc_commit_wait; uint64_t xc_current_sequence; @@ -487,6 +488,10 @@ int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector, struct xlog_in_core **commit_iclog, uint optype); int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket, struct xlog_in_core **iclog, xfs_lsn_t *lsn); +void xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog, + int eventual_size); +int xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog); + void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket); void xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket); @@ -558,12 +563,13 @@ void xlog_cil_commit(struct xlog *log, struct xfs_trans *tp, /* * CIL force routines */ -xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence); +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags, + uint64_t sequence); static inline void xlog_cil_force(struct xlog *log) { - xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence); + xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence); } /* diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c index f1bc88f4367c..18dc5eca6c04 100644 --- a/fs/xfs/xfs_sysfs.c +++ b/fs/xfs/xfs_sysfs.c @@ -10,6 +10,7 @@ #include "xfs_log_format.h" #include "xfs_trans_resv.h" #include "xfs_sysfs.h" +#include "xfs_log.h" #include "xfs_log_priv.h" #include "xfs_mount.h" diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c index 9b8d703dc9fd..d111a994b7b6 100644 --- a/fs/xfs/xfs_trace.c +++ b/fs/xfs/xfs_trace.c @@ -20,6 +20,7 @@ #include "xfs_bmap.h" #include "xfs_attr.h" #include "xfs_trans.h" +#include "xfs_log.h" #include "xfs_log_priv.h" #include "xfs_buf_item.h" #include "xfs_quota.h" diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 697703f3be48..7d05694681e3 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -9,7 +9,6 @@ #include "xfs_shared.h" #include "xfs_format.h" #include "xfs_log_format.h" -#include "xfs_log_priv.h" #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_extent_busy.h" @@ -17,6 +16,7 @@ #include "xfs_trans.h" #include "xfs_trans_priv.h" #include "xfs_log.h" +#include "xfs_log_priv.h" #include "xfs_trace.h" #include "xfs_error.h" #include "xfs_defer.h" diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index dbb69b4bf3ed..dfc0206c0d36 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -17,6 +17,7 @@ #include "xfs_errortag.h" #include "xfs_error.h" #include "xfs_log.h" +#include "xfs_log_priv.h" #ifdef DEBUG /* @@ -429,8 +430,12 @@ xfsaild_push( /* * If we encountered pinned items or did not finish writing out all - * buffers the last time we ran, force the log first and wait for it - * before pushing again. + * buffers the last time we ran, force a background CIL push to get the + * items unpinned in the near future. We do not wait on the CIL push as + * that could stall us for seconds if there is enough background IO + * load. Stalling for that long when the tail of the log is pinned and + * needs flushing will hard stop the transaction subsystem when log + * space runs out. */ if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 && (!list_empty_careful(&ailp->ail_buf_list) || @@ -438,7 +443,7 @@ xfsaild_push( ailp->ail_log_flush = 0; XFS_STATS_INC(mp, xs_push_ail_flush); - xfs_log_force(mp, XFS_LOG_SYNC); + xlog_cil_force_seq(mp->m_log, 0, 0); } spin_lock(&ailp->ail_lock); ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing 2021-02-24 23:26 ` [PATCH 2/3 v2] " Dave Chinner @ 2021-02-25 2:15 ` Dave Chinner 2021-03-02 21:44 ` Brian Foster 1 sibling, 0 replies; 33+ messages in thread From: Dave Chinner @ 2021-02-25 2:15 UTC (permalink / raw) To: linux-xfs On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The AIL pushing is stalling on log forces when it comes across > pinned items. This is happening on removal workloads where the AIL > is dominated by stale items that are removed from AIL when the > checkpoint that marks the items stale is committed to the journal. > This results is relatively few items in the AIL, but those that are > are often pinned as directories items are being removed from are > still being logged. ..... > One of the complexities here is that the CIL push does not guarantee > that the commit record for the CIL checkpoint is written to disk. > The current log force ensures this by submitting the current ACTIVE > iclog that the commit record was written to. We need the CIL to > actually write this commit record to disk for an async push to > ensure that the checkpoint actually makes it to disk and unpins the > pinned items in the checkpoint on completion. Hence we need to pass > down to the CIL push that we are doing an async flush so that it can > switch out the commit_iclog if necessary to get written to disk when > the commit iclog is finally released. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > Version 2: > - ensure the CIL checkpoint issues the commit record to disk for an > async push. Fixes generic/530 hang on small logs. > - increment log force stats when the CIL is forced and also when it > sleeps to give insight into the amount of blocking being done when > the CIL is forced. Oops, looks like I forgot to strip debug trace_printk()s out of the patch before sending it. They are gone now, so I'll wait for review comments before resending again... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing 2021-02-24 23:26 ` [PATCH 2/3 v2] " Dave Chinner 2021-02-25 2:15 ` Dave Chinner @ 2021-03-02 21:44 ` Brian Foster 2021-03-03 0:57 ` Dave Chinner 1 sibling, 1 reply; 33+ messages in thread From: Brian Foster @ 2021-03-02 21:44 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > The AIL pushing is stalling on log forces when it comes across > pinned items. This is happening on removal workloads where the AIL > is dominated by stale items that are removed from AIL when the > checkpoint that marks the items stale is committed to the journal. > This results is relatively few items in the AIL, but those that are > are often pinned as directories items are being removed from are > still being logged. > > As a result, many push cycles through the CIL will first issue a > blocking log force to unpin the items. This can take some time to > complete, with tracing regularly showing push delays of half a > second and sometimes up into the range of several seconds. Sequences > like this aren't uncommon: > ... > > In each of these cases, every AIL pass saw 101 log items stuck on > the AIL (pinned) with very few other items being found. Each pass, a > log force was issued, and delay between last/first is the sleep time > + the sync log force time. > > Some of these 101 items pinned the tail of the log. The tail of the > log does slowly creep forward (first lsn), but the problem is that > the log is actually out of reservation space because it's been > running so many transactions that stale items that never reach the > AIL but consume log space. Hence we have a largely empty AIL, with > long term pins on items that pin the tail of the log that don't get > pushed frequently enough to keep log space available. > > The problem is the hundreds of milliseconds that we block in the log > force pushing the CIL out to disk. The AIL should not be stalled > like this - it needs to run and flush items that are at the tail of > the log with minimal latency. What we really need to do is trigger a > log flush, but then not wait for it at all - we've already done our > waiting for stuff to complete when we backed off prior to the log > force being issued. > > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we > still do a blocking flush of the CIL and that is what is causing the > issue. Hence we need a new interface for the CIL to trigger an > immediate background push of the CIL to get it moving faster but not > to wait on that to occur. While the CIL is pushing, the AIL can also > be pushing. > So I think I follow what you're describing wrt to the xfsaild delay, but what I'm not clear on is what changing this behavior actually fixes. IOW, if the xfsaild is cycling through and finding mostly stuck items, then we see a delay because of the time spent in a log force (because of those stuck items), what's the advantage to issue an async force over a sync force? Won't xfsaild effectively continue to spin on these stuck items until the log force completes, regardless? Is the issue not so much the throughput of xfsaild, but just the fact that the task stalls are large enough to cause other problems? I.e., the comment you add down in the xfsaild code refers to tail pinning causing reservation stalls. IOW, the purpose of this patch is then to keep xfsaild sleeps to a predictable latency to ensure we come around again quickly enough to keep the tail moving forward (even though the AIL itself might remain in a state where the majority of items are stuck). Yes? > We already have an internal interface to do this - > xlog_cil_push_now() - but we need a wrapper for it to be used > externally. xlog_cil_force_seq() can easily be extended to do what > we need as it already implements the synchronous CIL push via > xlog_cil_push_now(). Add the necessary flags and "push current > sequence" semantics to xlog_cil_force_seq() and convert the AIL > pushing to use it. > > One of the complexities here is that the CIL push does not guarantee > that the commit record for the CIL checkpoint is written to disk. > The current log force ensures this by submitting the current ACTIVE > iclog that the commit record was written to. We need the CIL to > actually write this commit record to disk for an async push to > ensure that the checkpoint actually makes it to disk and unpins the > pinned items in the checkpoint on completion. Hence we need to pass > down to the CIL push that we are doing an async flush so that it can > switch out the commit_iclog if necessary to get written to disk when > the commit iclog is finally released. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > Version 2: > - ensure the CIL checkpoint issues the commit record to disk for an > async push. Fixes generic/530 hang on small logs. > - increment log force stats when the CIL is forced and also when it > sleeps to give insight into the amount of blocking being done when > the CIL is forced. > > fs/xfs/xfs_log.c | 59 +++++++++++++++++++--------------------------- > fs/xfs/xfs_log.h | 2 +- > fs/xfs/xfs_log_cil.c | 64 +++++++++++++++++++++++++++++++++++++++++--------- > fs/xfs/xfs_log_priv.h | 10 ++++++-- > fs/xfs/xfs_sysfs.c | 1 + > fs/xfs/xfs_trace.c | 1 + > fs/xfs/xfs_trans.c | 2 +- > fs/xfs/xfs_trans_ail.c | 11 ++++++--- > 8 files changed, 97 insertions(+), 53 deletions(-) > ... > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 2fda8c4513b2..1b46559ef64a 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c ... > @@ -1127,13 +1159,22 @@ xlog_cil_force_seq( > > ASSERT(sequence <= cil->xc_current_sequence); > > + if (!sequence) > + sequence = cil->xc_current_sequence; > +trace_printk("seq %llu, curseq %llu, ctxseq %llu pushseq %llu flags 0x%x", > + sequence, cil->xc_current_sequence, cil->xc_ctx->sequence, > + cil->xc_push_seq, flags); > + > + trace_xfs_log_force(log->l_mp, sequence, _RET_IP_); > /* > * check to see if we need to force out the current context. > * xlog_cil_push() handles racing pushes for the same sequence, > * so no need to deal with it here. > */ > restart: > - xlog_cil_push_now(log, sequence); > + xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC); > + if (!(flags & XFS_LOG_SYNC)) > + return commit_lsn; Hm, so now we have sync and async log force and sync and async CIL push. A log force always requires a sync CIL push and commit record submit; the difference is simply whether or not we wait on commit record I/O completion. The sync CIL push drains the CIL of log items but does not switch out the commit record iclog, while the async CIL push executes in the workqueue context so the drain is async, but it does switch out the commit record iclog before it completes. IOW, the async CIL push is basically a "more async" async log force. I can see the need for the behavior of the async CIL push here, but that leaves a mess of interfaces and behavior matrix. Is there any reason we couldn't just make async log forces unconditionally behave equivalent to the async CIL push as defined by this patch? There's only a handful of existing users and I don't see any obvious reason why they might care whether the underlying CIL push is synchronous or not... Brian > > /* > * See if we can find a previous sequence still committing. > @@ -1157,6 +1198,7 @@ xlog_cil_force_seq( > * It is still being pushed! Wait for the push to > * complete, then start again from the beginning. > */ > + XFS_STATS_INC(log->l_mp, xs_log_force_sleep); > xlog_wait(&cil->xc_commit_wait, &cil->xc_push_lock); > goto restart; > } > diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h > index 7e8ec77bc6e6..6b1a251dd013 100644 > --- a/fs/xfs/xfs_log_priv.h > +++ b/fs/xfs/xfs_log_priv.h > @@ -273,6 +273,7 @@ struct xfs_cil { > > spinlock_t xc_push_lock ____cacheline_aligned_in_smp; > uint64_t xc_push_seq; > + bool xc_push_async; > struct list_head xc_committing; > wait_queue_head_t xc_commit_wait; > uint64_t xc_current_sequence; > @@ -487,6 +488,10 @@ int xlog_write(struct xlog *log, struct xfs_log_vec *log_vector, > struct xlog_in_core **commit_iclog, uint optype); > int xlog_commit_record(struct xlog *log, struct xlog_ticket *ticket, > struct xlog_in_core **iclog, xfs_lsn_t *lsn); > +void xlog_state_switch_iclogs(struct xlog *log, struct xlog_in_core *iclog, > + int eventual_size); > +int xlog_state_release_iclog(struct xlog *xlog, struct xlog_in_core *iclog); > + > void xfs_log_ticket_ungrant(struct xlog *log, struct xlog_ticket *ticket); > void xfs_log_ticket_regrant(struct xlog *log, struct xlog_ticket *ticket); > > @@ -558,12 +563,13 @@ void xlog_cil_commit(struct xlog *log, struct xfs_trans *tp, > /* > * CIL force routines > */ > -xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint64_t sequence); > +xfs_lsn_t xlog_cil_force_seq(struct xlog *log, uint32_t flags, > + uint64_t sequence); > > static inline void > xlog_cil_force(struct xlog *log) > { > - xlog_cil_force_seq(log, log->l_cilp->xc_current_sequence); > + xlog_cil_force_seq(log, XFS_LOG_SYNC, log->l_cilp->xc_current_sequence); > } > > /* > diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c > index f1bc88f4367c..18dc5eca6c04 100644 > --- a/fs/xfs/xfs_sysfs.c > +++ b/fs/xfs/xfs_sysfs.c > @@ -10,6 +10,7 @@ > #include "xfs_log_format.h" > #include "xfs_trans_resv.h" > #include "xfs_sysfs.h" > +#include "xfs_log.h" > #include "xfs_log_priv.h" > #include "xfs_mount.h" > > diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c > index 9b8d703dc9fd..d111a994b7b6 100644 > --- a/fs/xfs/xfs_trace.c > +++ b/fs/xfs/xfs_trace.c > @@ -20,6 +20,7 @@ > #include "xfs_bmap.h" > #include "xfs_attr.h" > #include "xfs_trans.h" > +#include "xfs_log.h" > #include "xfs_log_priv.h" > #include "xfs_buf_item.h" > #include "xfs_quota.h" > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c > index 697703f3be48..7d05694681e3 100644 > --- a/fs/xfs/xfs_trans.c > +++ b/fs/xfs/xfs_trans.c > @@ -9,7 +9,6 @@ > #include "xfs_shared.h" > #include "xfs_format.h" > #include "xfs_log_format.h" > -#include "xfs_log_priv.h" > #include "xfs_trans_resv.h" > #include "xfs_mount.h" > #include "xfs_extent_busy.h" > @@ -17,6 +16,7 @@ > #include "xfs_trans.h" > #include "xfs_trans_priv.h" > #include "xfs_log.h" > +#include "xfs_log_priv.h" > #include "xfs_trace.h" > #include "xfs_error.h" > #include "xfs_defer.h" > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c > index dbb69b4bf3ed..dfc0206c0d36 100644 > --- a/fs/xfs/xfs_trans_ail.c > +++ b/fs/xfs/xfs_trans_ail.c > @@ -17,6 +17,7 @@ > #include "xfs_errortag.h" > #include "xfs_error.h" > #include "xfs_log.h" > +#include "xfs_log_priv.h" > > #ifdef DEBUG > /* > @@ -429,8 +430,12 @@ xfsaild_push( > > /* > * If we encountered pinned items or did not finish writing out all > - * buffers the last time we ran, force the log first and wait for it > - * before pushing again. > + * buffers the last time we ran, force a background CIL push to get the > + * items unpinned in the near future. We do not wait on the CIL push as > + * that could stall us for seconds if there is enough background IO > + * load. Stalling for that long when the tail of the log is pinned and > + * needs flushing will hard stop the transaction subsystem when log > + * space runs out. > */ > if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 && > (!list_empty_careful(&ailp->ail_buf_list) || > @@ -438,7 +443,7 @@ xfsaild_push( > ailp->ail_log_flush = 0; > > XFS_STATS_INC(mp, xs_push_ail_flush); > - xfs_log_force(mp, XFS_LOG_SYNC); > + xlog_cil_force_seq(mp->m_log, 0, 0); > } > > spin_lock(&ailp->ail_lock); > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing 2021-03-02 21:44 ` Brian Foster @ 2021-03-03 0:57 ` Dave Chinner 2021-03-03 17:32 ` Brian Foster 0 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2021-03-03 0:57 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote: > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > The AIL pushing is stalling on log forces when it comes across > > pinned items. This is happening on removal workloads where the AIL > > is dominated by stale items that are removed from AIL when the > > checkpoint that marks the items stale is committed to the journal. > > This results is relatively few items in the AIL, but those that are > > are often pinned as directories items are being removed from are > > still being logged. > > > > As a result, many push cycles through the CIL will first issue a > > blocking log force to unpin the items. This can take some time to > > complete, with tracing regularly showing push delays of half a > > second and sometimes up into the range of several seconds. Sequences > > like this aren't uncommon: > > > ... > > > > In each of these cases, every AIL pass saw 101 log items stuck on > > the AIL (pinned) with very few other items being found. Each pass, a > > log force was issued, and delay between last/first is the sleep time > > + the sync log force time. > > > > Some of these 101 items pinned the tail of the log. The tail of the > > log does slowly creep forward (first lsn), but the problem is that > > the log is actually out of reservation space because it's been > > running so many transactions that stale items that never reach the > > AIL but consume log space. Hence we have a largely empty AIL, with > > long term pins on items that pin the tail of the log that don't get > > pushed frequently enough to keep log space available. > > > > The problem is the hundreds of milliseconds that we block in the log > > force pushing the CIL out to disk. The AIL should not be stalled > > like this - it needs to run and flush items that are at the tail of > > the log with minimal latency. What we really need to do is trigger a > > log flush, but then not wait for it at all - we've already done our > > waiting for stuff to complete when we backed off prior to the log > > force being issued. > > > > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we > > still do a blocking flush of the CIL and that is what is causing the > > issue. Hence we need a new interface for the CIL to trigger an > > immediate background push of the CIL to get it moving faster but not > > to wait on that to occur. While the CIL is pushing, the AIL can also > > be pushing. > > > > So I think I follow what you're describing wrt to the xfsaild delay, but > what I'm not clear on is what changing this behavior actually fixes. > IOW, if the xfsaild is cycling through and finding mostly stuck items, > then we see a delay because of the time spent in a log force (because of > those stuck items), what's the advantage to issue an async force over a > sync force? Won't xfsaild effectively continue to spin on these stuck > items until the log force completes, regardless? When buffers get flushed and then pinned and can't be written back, we issue sync log forces from the AIL because the delwri list is not empty and the force count is non-zero. So it's not just pinned items in the AIL that trigger this - it's stuff that gets pinned after it has been flushed but before it gets written that causes problems, too. And if it's pinned buffers in the delwri list that are the issue, then we stop flushing other items from the list while we wait for the log force. hence if the tail pinning items are not actually pinned, but other buffers are, then we block on the log force when we could have been flushing and writing the items that pin the tail of the log. > Is the issue not so much the throughput of xfsaild, but just the fact > that the task stalls are large enough to cause other problems? I.e., the > comment you add down in the xfsaild code refers to tail pinning causing > reservation stalls. IOW, the purpose of this patch is then to keep > xfsaild sleeps to a predictable latency to ensure we come around again > quickly enough to keep the tail moving forward (even though the AIL > itself might remain in a state where the majority of items are stuck). > Yes? Yup, largely because there are multiple triggers for log forces. There's a difference between buffers that can't be flushed because they are pinned and flushed buffers that can't be written because they are pinned. The prior prevents the AIL from making push progress, the latter doesn't. > > +trace_printk("seq %llu, curseq %llu, ctxseq %llu pushseq %llu flags 0x%x", > > + sequence, cil->xc_current_sequence, cil->xc_ctx->sequence, > > + cil->xc_push_seq, flags); > > + > > + trace_xfs_log_force(log->l_mp, sequence, _RET_IP_); > > /* > > * check to see if we need to force out the current context. > > * xlog_cil_push() handles racing pushes for the same sequence, > > * so no need to deal with it here. > > */ > > restart: > > - xlog_cil_push_now(log, sequence); > > + xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC); > > + if (!(flags & XFS_LOG_SYNC)) > > + return commit_lsn; > > Hm, so now we have sync and async log force and sync and async CIL push. > A log force always requires a sync CIL push and commit record submit; > the difference is simply whether or not we wait on commit record I/O > completion. The sync CIL push drains the CIL of log items but does not > switch out the commit record iclog, while the async CIL push executes in > the workqueue context so the drain is async, but it does switch out the > commit record iclog before it completes. IOW, the async CIL push is > basically a "more async" async log force. Yes. > I can see the need for the behavior of the async CIL push here, but that > leaves a mess of interfaces and behavior matrix. Is there any reason we > couldn't just make async log forces unconditionally behave equivalent to > the async CIL push as defined by this patch? There's only a handful of > existing users and I don't see any obvious reason why they might care > whether the underlying CIL push is synchronous or not... I'm not touching the rest of the log force code in this series - it is out of scope of this bug fix and the rest of the series that it is part of. Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and changing things like log force behaviour and implementation is for a future series. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing 2021-03-03 0:57 ` Dave Chinner @ 2021-03-03 17:32 ` Brian Foster 2021-03-04 1:59 ` Dave Chinner 0 siblings, 1 reply; 33+ messages in thread From: Brian Foster @ 2021-03-03 17:32 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote: > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote: > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > The AIL pushing is stalling on log forces when it comes across > > > pinned items. This is happening on removal workloads where the AIL > > > is dominated by stale items that are removed from AIL when the > > > checkpoint that marks the items stale is committed to the journal. > > > This results is relatively few items in the AIL, but those that are > > > are often pinned as directories items are being removed from are > > > still being logged. > > > > > > As a result, many push cycles through the CIL will first issue a > > > blocking log force to unpin the items. This can take some time to > > > complete, with tracing regularly showing push delays of half a > > > second and sometimes up into the range of several seconds. Sequences > > > like this aren't uncommon: > > > > > ... > > > > > > In each of these cases, every AIL pass saw 101 log items stuck on > > > the AIL (pinned) with very few other items being found. Each pass, a > > > log force was issued, and delay between last/first is the sleep time > > > + the sync log force time. > > > > > > Some of these 101 items pinned the tail of the log. The tail of the > > > log does slowly creep forward (first lsn), but the problem is that > > > the log is actually out of reservation space because it's been > > > running so many transactions that stale items that never reach the > > > AIL but consume log space. Hence we have a largely empty AIL, with > > > long term pins on items that pin the tail of the log that don't get > > > pushed frequently enough to keep log space available. > > > > > > The problem is the hundreds of milliseconds that we block in the log > > > force pushing the CIL out to disk. The AIL should not be stalled > > > like this - it needs to run and flush items that are at the tail of > > > the log with minimal latency. What we really need to do is trigger a > > > log flush, but then not wait for it at all - we've already done our > > > waiting for stuff to complete when we backed off prior to the log > > > force being issued. > > > > > > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we > > > still do a blocking flush of the CIL and that is what is causing the > > > issue. Hence we need a new interface for the CIL to trigger an > > > immediate background push of the CIL to get it moving faster but not > > > to wait on that to occur. While the CIL is pushing, the AIL can also > > > be pushing. > > > > > > > So I think I follow what you're describing wrt to the xfsaild delay, but > > what I'm not clear on is what changing this behavior actually fixes. > > IOW, if the xfsaild is cycling through and finding mostly stuck items, > > then we see a delay because of the time spent in a log force (because of > > those stuck items), what's the advantage to issue an async force over a > > sync force? Won't xfsaild effectively continue to spin on these stuck > > items until the log force completes, regardless? > > When buffers get flushed and then pinned and can't be written back, > we issue sync log forces from the AIL because the delwri list is not > empty and the force count is non-zero. > > So it's not just pinned items in the AIL that trigger this - it's > stuff that gets pinned after it has been flushed but before it gets > written that causes problems, too. > > And if it's pinned buffers in the delwri list that are the issue, > then we stop flushing other items from the list while we wait for > the log force. hence if the tail pinning items are not actually > pinned, but other buffers are, then we block on the log force when > we could have been flushing and writing the items that pin the tail > of the log. > > > Is the issue not so much the throughput of xfsaild, but just the fact > > that the task stalls are large enough to cause other problems? I.e., the > > comment you add down in the xfsaild code refers to tail pinning causing > > reservation stalls. IOW, the purpose of this patch is then to keep > > xfsaild sleeps to a predictable latency to ensure we come around again > > quickly enough to keep the tail moving forward (even though the AIL > > itself might remain in a state where the majority of items are stuck). > > Yes? > > Yup, largely because there are multiple triggers for log forces. > There's a difference between buffers that can't be flushed because > they are pinned and flushed buffers that can't be written because > they are pinned. The prior prevents the AIL from making push progress, > the latter doesn't. > > > > +trace_printk("seq %llu, curseq %llu, ctxseq %llu pushseq %llu flags 0x%x", > > > + sequence, cil->xc_current_sequence, cil->xc_ctx->sequence, > > > + cil->xc_push_seq, flags); > > > + > > > + trace_xfs_log_force(log->l_mp, sequence, _RET_IP_); > > > /* > > > * check to see if we need to force out the current context. > > > * xlog_cil_push() handles racing pushes for the same sequence, > > > * so no need to deal with it here. > > > */ > > > restart: > > > - xlog_cil_push_now(log, sequence); > > > + xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC); > > > + if (!(flags & XFS_LOG_SYNC)) > > > + return commit_lsn; > > > > Hm, so now we have sync and async log force and sync and async CIL push. > > A log force always requires a sync CIL push and commit record submit; > > the difference is simply whether or not we wait on commit record I/O > > completion. The sync CIL push drains the CIL of log items but does not > > switch out the commit record iclog, while the async CIL push executes in > > the workqueue context so the drain is async, but it does switch out the > > commit record iclog before it completes. IOW, the async CIL push is > > basically a "more async" async log force. > > Yes. > > > I can see the need for the behavior of the async CIL push here, but that > > leaves a mess of interfaces and behavior matrix. Is there any reason we > > couldn't just make async log forces unconditionally behave equivalent to > > the async CIL push as defined by this patch? There's only a handful of > > existing users and I don't see any obvious reason why they might care > > whether the underlying CIL push is synchronous or not... > > I'm not touching the rest of the log force code in this series - it > is out of scope of this bug fix and the rest of the series that it > is part of. > But you already have altered the log force code by changing xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq() behaves. So not only does this patch expose subsystem internals to external layers and create more log forcing interfaces/behaviors to maintain, it invokes one or the other from the preexisting async log force variants purely based on whether the caller had a target sequence number or not. This doesn't make a whole lot of sense to me. > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and > changing things like log force behaviour and implementation is for > a future series. > TBH I think this patch is kind of a mess on its own. I think the mechanism it wants to provide is sane, but I've not even got to the point of reviewing _that_ yet because of the seeming dismissal of higher level feedback. I'd rather not go around in circles on this so I'll just offer my summarized feedback to this patch... I think this patch should be broken down into two, possibly three sub-patches that generally do the following (not necessarily in this order): 1. Implement the new underlying async CIL push mechanism (i.e. the ->xc_push_async bits and new commit iclog switching behavior). 2. Change the behavior of existing async log forces to use the new mechanism consistently. Document the change in behavior for historical reference. 3. Switch xfsaild to use an async log force instead of the current sync log force. Document why this might depend on the updated async behavior. Further log subsystem cleanup or consolidation that this might facilitate between the historical async log force and the newer CIL based async log force can occur separately in future patches.. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing 2021-03-03 17:32 ` Brian Foster @ 2021-03-04 1:59 ` Dave Chinner 2021-03-04 13:13 ` Brian Foster 0 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2021-03-04 1:59 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Wed, Mar 03, 2021 at 12:32:39PM -0500, Brian Foster wrote: > On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote: > > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote: > > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote: > > > > * xlog_cil_push() handles racing pushes for the same sequence, > > > > * so no need to deal with it here. > > > > */ > > > > restart: > > > > - xlog_cil_push_now(log, sequence); > > > > + xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC); > > > > + if (!(flags & XFS_LOG_SYNC)) > > > > + return commit_lsn; > > > > > > Hm, so now we have sync and async log force and sync and async CIL push. > > > A log force always requires a sync CIL push and commit record submit; > > > the difference is simply whether or not we wait on commit record I/O > > > completion. The sync CIL push drains the CIL of log items but does not > > > switch out the commit record iclog, while the async CIL push executes in > > > the workqueue context so the drain is async, but it does switch out the > > > commit record iclog before it completes. IOW, the async CIL push is > > > basically a "more async" async log force. > > > > Yes. > > > > > I can see the need for the behavior of the async CIL push here, but that > > > leaves a mess of interfaces and behavior matrix. Is there any reason we > > > couldn't just make async log forces unconditionally behave equivalent to > > > the async CIL push as defined by this patch? There's only a handful of > > > existing users and I don't see any obvious reason why they might care > > > whether the underlying CIL push is synchronous or not... > > > > I'm not touching the rest of the log force code in this series - it > > is out of scope of this bug fix and the rest of the series that it > > is part of. > > > > But you already have altered the log force code by changing > xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq() > behaves. The behaviour of the function when called from xfs_log_force*() should be unchanged. Can you be specific about exactly what behaviour did I change for non-synchronous xfs_log_force*() callers so I can fix it? I have not intended to change it at all, so whatever you are refering is an issue I need to fix... > So not only does this patch expose subsystem internals to > external layers and create more log forcing interfaces/behaviors to Sorry, I don't follow. What "subsystem internals" are being exposed and what external layer are they being exposed to? > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and > > changing things like log force behaviour and implementation is for > > a future series. > > > > TBH I think this patch is kind of a mess on its own. I think the > mechanism it wants to provide is sane, but I've not even got to the > point of reviewing _that_ yet because of the seeming dismissal of higher > level feedback. I'd rather not go around in circles on this so I'll just > offer my summarized feedback to this patch... I'm not dismissing review nor am I saying the API cannot or should not be improved. I'm simply telling you that I think the additional changes you are proposing are outside the scope of the problem I am addressing here. I already plan to rework the log force API (and others) but doing so it not something that this patchset needs to address, or indeed should address. There are already enough subtle changes being made to core code and algorithms that mixing them with unrelated high level external behavioural changes that it greatly increases the risk of unexpected regressions in the patchset. The log force are paths are used in data integrity paths, so I want to limit the scope of behavioural change to just the AIL where the log force has no data integrity requirement associcated with it. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing 2021-03-04 1:59 ` Dave Chinner @ 2021-03-04 13:13 ` Brian Foster 2021-03-04 22:48 ` Dave Chinner 0 siblings, 1 reply; 33+ messages in thread From: Brian Foster @ 2021-03-04 13:13 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Thu, Mar 04, 2021 at 12:59:33PM +1100, Dave Chinner wrote: > On Wed, Mar 03, 2021 at 12:32:39PM -0500, Brian Foster wrote: > > On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote: > > > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote: > > > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote: > > > > > * xlog_cil_push() handles racing pushes for the same sequence, > > > > > * so no need to deal with it here. > > > > > */ > > > > > restart: > > > > > - xlog_cil_push_now(log, sequence); > > > > > + xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC); > > > > > + if (!(flags & XFS_LOG_SYNC)) > > > > > + return commit_lsn; > > > > > > > > Hm, so now we have sync and async log force and sync and async CIL push. > > > > A log force always requires a sync CIL push and commit record submit; > > > > the difference is simply whether or not we wait on commit record I/O > > > > completion. The sync CIL push drains the CIL of log items but does not > > > > switch out the commit record iclog, while the async CIL push executes in > > > > the workqueue context so the drain is async, but it does switch out the > > > > commit record iclog before it completes. IOW, the async CIL push is > > > > basically a "more async" async log force. > > > > > > Yes. > > > > > > > I can see the need for the behavior of the async CIL push here, but that > > > > leaves a mess of interfaces and behavior matrix. Is there any reason we > > > > couldn't just make async log forces unconditionally behave equivalent to > > > > the async CIL push as defined by this patch? There's only a handful of > > > > existing users and I don't see any obvious reason why they might care > > > > whether the underlying CIL push is synchronous or not... > > > > > > I'm not touching the rest of the log force code in this series - it > > > is out of scope of this bug fix and the rest of the series that it > > > is part of. > > > > > > > But you already have altered the log force code by changing > > xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq() > > behaves. > > The behaviour of the function when called from xfs_log_force*() > should be unchanged. Can you be specific about exactly what > behaviour did I change for non-synchronous xfs_log_force*() callers > so I can fix it? I have not intended to change it at all, so > whatever you are refering is an issue I need to fix... > xfs_log_force_seq() passes flags from the caller to xlog_cil_force_seq() (whereas this patch presumably wants it to pass XFS_LOG_SYNC unconditionally). IOW, xfs_log_force(mp, 0) behavior is different from xfs_log_force_seq(mp, seq, 0, ...). > > So not only does this patch expose subsystem internals to > > external layers and create more log forcing interfaces/behaviors to > > Sorry, I don't follow. What "subsystem internals" are being exposed > and what external layer are they being exposed to? > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and > > > changing things like log force behaviour and implementation is for > > > a future series. > > > > > > > TBH I think this patch is kind of a mess on its own. I think the > > mechanism it wants to provide is sane, but I've not even got to the > > point of reviewing _that_ yet because of the seeming dismissal of higher > > level feedback. I'd rather not go around in circles on this so I'll just > > offer my summarized feedback to this patch... > > I'm not dismissing review nor am I saying the API cannot or should > not be improved. I'm simply telling you that I think the additional > changes you are proposing are outside the scope of the problem I am > addressing here. I already plan to rework the log force API (and > others) but doing so it not something that this patchset needs to > address, or indeed should address. > I'm not proposing additional changes nor to rework the log force API. I'm pointing out that I find this implementation to be extremely and unnecessarily confusing. To improve it, I'm suggesting to either coopt the existing async log force API... > There are already enough subtle changes being made to core code and > algorithms that mixing them with unrelated high level external > behavioural changes that it greatly increases the risk of unexpected > regressions in the patchset. The log force are paths are used in > data integrity paths, so I want to limit the scope of behavioural > change to just the AIL where the log force has no data integrity > requirement associcated with it. > ... or if we really want a special async log force just for xfsaild (why is still not clear to me), then tie it to an XFS_LOG_REALLY_ASYNC flag or some such, pass that to the existing log force call, and document the purpose/behavior of the new mode in detail. That at least won't require a developer to wonder what !(flags & XFS_LOG_SYNC) happens to mean depending on the particular log force function. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing 2021-03-04 13:13 ` Brian Foster @ 2021-03-04 22:48 ` Dave Chinner 2021-03-05 14:58 ` Brian Foster 0 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2021-03-04 22:48 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Thu, Mar 04, 2021 at 08:13:07AM -0500, Brian Foster wrote: > On Thu, Mar 04, 2021 at 12:59:33PM +1100, Dave Chinner wrote: > > On Wed, Mar 03, 2021 at 12:32:39PM -0500, Brian Foster wrote: > > > On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote: > > > > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote: > > > > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote: > > > > > > * xlog_cil_push() handles racing pushes for the same sequence, > > > > > > * so no need to deal with it here. > > > > > > */ > > > > > > restart: > > > > > > - xlog_cil_push_now(log, sequence); > > > > > > + xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC); > > > > > > + if (!(flags & XFS_LOG_SYNC)) > > > > > > + return commit_lsn; > > > > > > > > > > Hm, so now we have sync and async log force and sync and async CIL push. > > > > > A log force always requires a sync CIL push and commit record submit; > > > > > the difference is simply whether or not we wait on commit record I/O > > > > > completion. The sync CIL push drains the CIL of log items but does not > > > > > switch out the commit record iclog, while the async CIL push executes in > > > > > the workqueue context so the drain is async, but it does switch out the > > > > > commit record iclog before it completes. IOW, the async CIL push is > > > > > basically a "more async" async log force. > > > > > > > > Yes. > > > > > > > > > I can see the need for the behavior of the async CIL push here, but that > > > > > leaves a mess of interfaces and behavior matrix. Is there any reason we > > > > > couldn't just make async log forces unconditionally behave equivalent to > > > > > the async CIL push as defined by this patch? There's only a handful of > > > > > existing users and I don't see any obvious reason why they might care > > > > > whether the underlying CIL push is synchronous or not... > > > > > > > > I'm not touching the rest of the log force code in this series - it > > > > is out of scope of this bug fix and the rest of the series that it > > > > is part of. > > > > > > > > > > But you already have altered the log force code by changing > > > xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq() > > > behaves. > > > > The behaviour of the function when called from xfs_log_force*() > > should be unchanged. Can you be specific about exactly what > > behaviour did I change for non-synchronous xfs_log_force*() callers > > so I can fix it? I have not intended to change it at all, so > > whatever you are refering is an issue I need to fix... > > > > xfs_log_force_seq() passes flags from the caller to xlog_cil_force_seq() > (whereas this patch presumably wants it to pass XFS_LOG_SYNC > unconditionally). IOW, xfs_log_force(mp, 0) behavior is different from > xfs_log_force_seq(mp, seq, 0, ...). Fixed. > > > So not only does this patch expose subsystem internals to > > > external layers and create more log forcing interfaces/behaviors to > > > > Sorry, I don't follow. What "subsystem internals" are being exposed > > and what external layer are they being exposed to? > > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and > > > > changing things like log force behaviour and implementation is for > > > > a future series. > > > > > > > > > > TBH I think this patch is kind of a mess on its own. I think the > > > mechanism it wants to provide is sane, but I've not even got to the > > > point of reviewing _that_ yet because of the seeming dismissal of higher > > > level feedback. I'd rather not go around in circles on this so I'll just > > > offer my summarized feedback to this patch... > > > > I'm not dismissing review nor am I saying the API cannot or should > > not be improved. I'm simply telling you that I think the additional > > changes you are proposing are outside the scope of the problem I am > > addressing here. I already plan to rework the log force API (and > > others) but doing so it not something that this patchset needs to > > address, or indeed should address. > > > > I'm not proposing additional changes nor to rework the log force API. > I'm pointing out that I find this implementation to be extremely and > unnecessarily confusing. And that's just fine - not everyone has to understand all of the code all of the time. That's why we have a team.... > To improve it, I'm suggesting to either coopt > the existing async log force API... And I'm telling you that this is *future work*. As the person actually doing the work, that's my decision and you need to accept that. I understand your concerns and have a plan to address them - you just have to accept that the plan to address them isn't going to be exactly to your liking. > > There are already enough subtle changes being made to core code and > > algorithms that mixing them with unrelated high level external > > behavioural changes that it greatly increases the risk of unexpected > > regressions in the patchset. The log force are paths are used in > > data integrity paths, so I want to limit the scope of behavioural > > change to just the AIL where the log force has no data integrity > > requirement associcated with it. > > > > ... or if we really want a special async log force just for xfsaild (why > is still not clear to me), then tie it to an XFS_LOG_REALLY_ASYNC flag > or some such, pass that to the existing log force call, and document the > purpose/behavior of the new mode in detail. That at least won't require > a developer to wonder what !(flags & XFS_LOG_SYNC) happens to mean > depending on the particular log force function. No. That's just obfuscation. And it is extremely likely that the first thing Christoph will ask me to do with one-shot flag that just calls out to a single function is to get rid of the flag and call the function directly. This will get cleaned up. Just not in this patchset. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing 2021-03-04 22:48 ` Dave Chinner @ 2021-03-05 14:58 ` Brian Foster 2021-03-09 0:44 ` Dave Chinner 0 siblings, 1 reply; 33+ messages in thread From: Brian Foster @ 2021-03-05 14:58 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote: > On Thu, Mar 04, 2021 at 08:13:07AM -0500, Brian Foster wrote: > > On Thu, Mar 04, 2021 at 12:59:33PM +1100, Dave Chinner wrote: > > > On Wed, Mar 03, 2021 at 12:32:39PM -0500, Brian Foster wrote: > > > > On Wed, Mar 03, 2021 at 11:57:52AM +1100, Dave Chinner wrote: > > > > > On Tue, Mar 02, 2021 at 04:44:12PM -0500, Brian Foster wrote: > > > > > > On Thu, Feb 25, 2021 at 10:26:00AM +1100, Dave Chinner wrote: > > > > > > > * xlog_cil_push() handles racing pushes for the same sequence, > > > > > > > * so no need to deal with it here. > > > > > > > */ > > > > > > > restart: > > > > > > > - xlog_cil_push_now(log, sequence); > > > > > > > + xlog_cil_push_now(log, sequence, flags & XFS_LOG_SYNC); > > > > > > > + if (!(flags & XFS_LOG_SYNC)) > > > > > > > + return commit_lsn; > > > > > > > > > > > > Hm, so now we have sync and async log force and sync and async CIL push. > > > > > > A log force always requires a sync CIL push and commit record submit; > > > > > > the difference is simply whether or not we wait on commit record I/O > > > > > > completion. The sync CIL push drains the CIL of log items but does not > > > > > > switch out the commit record iclog, while the async CIL push executes in > > > > > > the workqueue context so the drain is async, but it does switch out the > > > > > > commit record iclog before it completes. IOW, the async CIL push is > > > > > > basically a "more async" async log force. > > > > > > > > > > Yes. > > > > > > > > > > > I can see the need for the behavior of the async CIL push here, but that > > > > > > leaves a mess of interfaces and behavior matrix. Is there any reason we > > > > > > couldn't just make async log forces unconditionally behave equivalent to > > > > > > the async CIL push as defined by this patch? There's only a handful of > > > > > > existing users and I don't see any obvious reason why they might care > > > > > > whether the underlying CIL push is synchronous or not... > > > > > > > > > > I'm not touching the rest of the log force code in this series - it > > > > > is out of scope of this bug fix and the rest of the series that it > > > > > is part of. > > > > > > > > > > > > > But you already have altered the log force code by changing > > > > xlog_cil_force_seq(), which implicitly changes how xfs_log_force_seq() > > > > behaves. > > > > > > The behaviour of the function when called from xfs_log_force*() > > > should be unchanged. Can you be specific about exactly what > > > behaviour did I change for non-synchronous xfs_log_force*() callers > > > so I can fix it? I have not intended to change it at all, so > > > whatever you are refering is an issue I need to fix... > > > > > > > xfs_log_force_seq() passes flags from the caller to xlog_cil_force_seq() > > (whereas this patch presumably wants it to pass XFS_LOG_SYNC > > unconditionally). IOW, xfs_log_force(mp, 0) behavior is different from > > xfs_log_force_seq(mp, seq, 0, ...). > > Fixed. > > > > > So not only does this patch expose subsystem internals to > > > > external layers and create more log forcing interfaces/behaviors to > > > > > > Sorry, I don't follow. What "subsystem internals" are being exposed > > > and what external layer are they being exposed to? > > > > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and > > > > > changing things like log force behaviour and implementation is for > > > > > a future series. > > > > > > > > > > > > > TBH I think this patch is kind of a mess on its own. I think the > > > > mechanism it wants to provide is sane, but I've not even got to the > > > > point of reviewing _that_ yet because of the seeming dismissal of higher > > > > level feedback. I'd rather not go around in circles on this so I'll just > > > > offer my summarized feedback to this patch... > > > > > > I'm not dismissing review nor am I saying the API cannot or should > > > not be improved. I'm simply telling you that I think the additional > > > changes you are proposing are outside the scope of the problem I am > > > addressing here. I already plan to rework the log force API (and > > > others) but doing so it not something that this patchset needs to > > > address, or indeed should address. > > > > > > > I'm not proposing additional changes nor to rework the log force API. > > I'm pointing out that I find this implementation to be extremely and > > unnecessarily confusing. > > And that's just fine - not everyone has to understand all of the > code all of the time. That's why we have a team.... > > > To improve it, I'm suggesting to either coopt > > the existing async log force API... > > And I'm telling you that this is *future work*. As the person > actually doing the work, that's my decision and you need to accept > that. I understand your concerns and have a plan to address them - > you just have to accept that the plan to address them isn't going to > be exactly to your liking. > As a reviewer, I am pointing out that I object to how this patch is implemented and offered several fairly simple suggestions on how to address my concerns. Those suggestions have been rejected pretty much out of hand on the basis of the existence of some future plans. Future rework plans do not justify or change my view of this patch and the mess it adds (for other developers, for historical context and downstreams, etc.). Therefore, if the only option you're willing to consider is "make this mess now and clean it up later in some broader rework," then I'd rather we just defer this patch until after that rework is available and avoid the mess that way. Brian > > > There are already enough subtle changes being made to core code and > > > algorithms that mixing them with unrelated high level external > > > behavioural changes that it greatly increases the risk of unexpected > > > regressions in the patchset. The log force are paths are used in > > > data integrity paths, so I want to limit the scope of behavioural > > > change to just the AIL where the log force has no data integrity > > > requirement associcated with it. > > > > > > > ... or if we really want a special async log force just for xfsaild (why > > is still not clear to me), then tie it to an XFS_LOG_REALLY_ASYNC flag > > or some such, pass that to the existing log force call, and document the > > purpose/behavior of the new mode in detail. That at least won't require > > a developer to wonder what !(flags & XFS_LOG_SYNC) happens to mean > > depending on the particular log force function. > > No. That's just obfuscation. And it is extremely likely that the > first thing Christoph will ask me to do with one-shot flag that just > calls out to a single function is to get rid of the flag and call > the function directly. > > This will get cleaned up. Just not in this patchset. > > -Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing 2021-03-05 14:58 ` Brian Foster @ 2021-03-09 0:44 ` Dave Chinner 2021-03-09 4:35 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Darrick J. Wong 2021-03-10 14:49 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing Brian Foster 0 siblings, 2 replies; 33+ messages in thread From: Dave Chinner @ 2021-03-09 0:44 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote: > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote: > > > > > So not only does this patch expose subsystem internals to > > > > > external layers and create more log forcing interfaces/behaviors to > > > > > > > > Sorry, I don't follow. What "subsystem internals" are being exposed > > > > and what external layer are they being exposed to? > > > > > > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and > > > > > > changing things like log force behaviour and implementation is for > > > > > > a future series. > > > > > > > > > > > > > > > > TBH I think this patch is kind of a mess on its own. I think the > > > > > mechanism it wants to provide is sane, but I've not even got to the > > > > > point of reviewing _that_ yet because of the seeming dismissal of higher > > > > > level feedback. I'd rather not go around in circles on this so I'll just > > > > > offer my summarized feedback to this patch... > > > > > > > > I'm not dismissing review nor am I saying the API cannot or should > > > > not be improved. I'm simply telling you that I think the additional > > > > changes you are proposing are outside the scope of the problem I am > > > > addressing here. I already plan to rework the log force API (and > > > > others) but doing so it not something that this patchset needs to > > > > address, or indeed should address. > > > > > > > > > > I'm not proposing additional changes nor to rework the log force API. > > > I'm pointing out that I find this implementation to be extremely and > > > unnecessarily confusing. > > > > And that's just fine - not everyone has to understand all of the > > code all of the time. That's why we have a team.... > > > > > To improve it, I'm suggesting to either coopt > > > the existing async log force API... > > > > And I'm telling you that this is *future work*. As the person > > actually doing the work, that's my decision and you need to accept > > that. I understand your concerns and have a plan to address them - > > you just have to accept that the plan to address them isn't going to > > be exactly to your liking. > > > > As a reviewer, I am pointing out that I object to how this patch is > implemented and offered several fairly simple suggestions on how to > address my concerns. Those suggestions have been rejected pretty much > out of hand on the basis of the existence of some future plans. We're allowed to disagree on the best approach. But to find a way forward means that both the summitter and the reviewer need to compromise. I've fixed up the obvious warts and bugs you've pointed out, and agreed that it needs cleaning up and have committed to cleaning it up in the near future. > Future rework plans do not justify or change my view of this patch > and the mess it adds (for other developers, for historical context > and downstreams, etc.). Therefore, if the only option you're > willing to consider is "make this mess now and clean it up later > in some broader rework," then I'd rather we just defer this patch > until after that rework is available and avoid the mess that way. So you won't review it until I have 100 outstanding patches in this series and it's completely and utterly unreviewable? Then you'll ask me to split it up and re-order it into digestable chunks, and so we'll go back to having to merge this because any of the API rework that depends on the mechanism that this patch introduces. I'm not going to play that "now jump through this hoop" game. We add flags for on-off behaviours in internal functions -all the time-. If this makes the interface so complex and confusing that you don't understand it, then the interface was already too complex and confusing. And fixing that is *not in the scope of this patchset*. Demanding that code be made perfect before it can be merged is really not very helpful. Especially when there are already plans to rework the API but that rework is dependent on a bunch of other changes than need to be done first. iclogs are something that need to be moved behind the CIL, not sit in front of CIL. The CIL abstracts the journal and writing to the journal completely away from the transaction subsystem, yet the log force code punches straight through that abstraction and walks iclogs directly. The whole log force implementation needs to change, and I plan for the API that wraps the log forces to get reworked at that time. For example, if we want to direct map storage for log writes, then iclog-based log force synchronisation needs to go away because we don't need iclogs for buffering journal writes. Hence the log foce code should interface only with the CIL, and only the CIL should manage whatever mechanism it is using to write to stable storage. The code is currently the way it is because the CIL, when first implemented, had to co-exist with the old way of writing to the log. We haven't used that old way for a decade now and we have very different storage performance characteristics these days, so it's about time we got rid of the mechanism designed to be optimal for spinning disks and actually integrated the CIL and the log efficiently. There are a *lot* of steps to do this, and reworking the log force implementation and API is part of that. But reworking that API is premature because we haven't done all the necessary pre-work in place to make such a change yet. This patch is actually part of that pre-work to get the mechanisms that the log force rework will rely on. I have very good reasons for pushing back against your suggestions, Brian. Your suggestions have merit but this patch is not the time to be making the changes you suggest. Code does not need to be perfect to be merged, nor does the entire larger change they will be part of need to be complete and 100% tested and reviewed before preparation and infrastructure patches can be merged. This is how we've done big changes in the past - they've been staged across multiple kernel cycles - and not everything needs to be done in the first patchset of a larger body of work.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ 2021-03-09 0:44 ` Dave Chinner @ 2021-03-09 4:35 ` Darrick J. Wong 2021-03-10 2:10 ` Brian Foster ` (2 more replies) 2021-03-10 14:49 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing Brian Foster 1 sibling, 3 replies; 33+ messages in thread From: Darrick J. Wong @ 2021-03-09 4:35 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, linux-xfs On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote: > On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote: > > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote: > > > > > > So not only does this patch expose subsystem internals to > > > > > > external layers and create more log forcing interfaces/behaviors to > > > > > > > > > > Sorry, I don't follow. What "subsystem internals" are being exposed > > > > > and what external layer are they being exposed to? > > > > > > > > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and > > > > > > > changing things like log force behaviour and implementation is for > > > > > > > a future series. > > > > > > > > > > > > > > > > > > > TBH I think this patch is kind of a mess on its own. I think the > > > > > > mechanism it wants to provide is sane, but I've not even got to the > > > > > > point of reviewing _that_ yet because of the seeming dismissal of higher > > > > > > level feedback. I'd rather not go around in circles on this so I'll just > > > > > > offer my summarized feedback to this patch... > > > > > > > > > > I'm not dismissing review nor am I saying the API cannot or should > > > > > not be improved. I'm simply telling you that I think the additional > > > > > changes you are proposing are outside the scope of the problem I am > > > > > addressing here. I already plan to rework the log force API (and > > > > > others) but doing so it not something that this patchset needs to > > > > > address, or indeed should address. > > > > > > > > > > > > > I'm not proposing additional changes nor to rework the log force API. > > > > I'm pointing out that I find this implementation to be extremely and > > > > unnecessarily confusing. > > > > > > And that's just fine - not everyone has to understand all of the > > > code all of the time. That's why we have a team.... > > > > > > > To improve it, I'm suggesting to either coopt > > > > the existing async log force API... > > > > > > And I'm telling you that this is *future work*. As the person > > > actually doing the work, that's my decision and you need to accept > > > that. I understand your concerns and have a plan to address them - > > > you just have to accept that the plan to address them isn't going to > > > be exactly to your liking. > > > > > > > As a reviewer, I am pointing out that I object to how this patch is > > implemented and offered several fairly simple suggestions on how to > > address my concerns. Those suggestions have been rejected pretty much > > out of hand on the basis of the existence of some future plans. > > We're allowed to disagree on the best approach. But to find a way > forward means that both the summitter and the reviewer need to > compromise. I've fixed up the obvious warts and bugs you've pointed > out, and agreed that it needs cleaning up and have committed to > cleaning it up in the near future. > > > Future rework plans do not justify or change my view of this patch > > and the mess it adds (for other developers, for historical context > > and downstreams, etc.). Therefore, if the only option you're > > willing to consider is "make this mess now and clean it up later > > in some broader rework," then I'd rather we just defer this patch > > until after that rework is available and avoid the mess that way. > > So you won't review it until I have 100 outstanding patches in this > series and it's completely and utterly unreviewable? Already unreviewable at 45, and I've only gotten through 2/3 of it. > Then you'll ask me to split it up and re-order it into digestable > chunks, and so we'll go back to having to merge this because any of > the API rework that depends on the mechanism that this patch > introduces. Here's something I haven't previously shared with all of you: Last cycle when we were going around and around on the ENOSPC/EDQUOT retry loop patches (which exploded from 13 to 41 patches) it was /very/ stressful to have to rework this and that part every day and a half for almost three weeks. When I get a patchset ready for submission, I export the patches from my development branch into a topic branch based off the latest -rc. Every time anyone makes a suggestion, I update the topic branch and send that to the fstests "cloud" to see if it broke anything. If it succeeds, I push it to k.org and resubmit. The part that's less obvious is the fact that rebasing is /not/ that simple. Next I integrate the changes into my development tree and rebase everything that comes after /that/ to make sure it doesn't interfere with my longer term development goals. Sometimes this is easy, sometimes this takes an entire day to work out the warts. Then I send that to the fstests "cloud". This rebase is particularly painful because every change that everyone makes to inode initialization collides with a rework of inode initialization that I've been working on in preparation for metadata directory trees. The part that's even /less/ obvious than that is that once the development tree tests ok, then I do the same to my xfsprogs dev tree to make sure that nothing broke. Frequently there are ABI or cognitive mismatches between kernel and userspace such that the build breaks, and then I have to patch the two kernel trees and re-run everything. So, that's really stressful for me because a minor tweak to an interface can result in an enormous amount of work. And I reject the argument that I could just rebase less frequently -- Dave does that, which means that any time he hears that one of his topic branches is being asked about, he has to spend weeks rebasing the whole mess to the latest upstream. Maybe that works for him, but for me, I would hate that even more than doing a daily rebase. Add to that the fact that vger has been a total delivery sh*tshow all year. Now in addition to feeling disconnected from my family and friends, I also feel disconnected from work people too. This really did nearly push me over the edge three weeks ago. Please remember, one of the big advantages of our open development processes is that we /do/ accept code with warty (but functional) interfaces now, and we can clean them up later. This is (IMHO) a good stress-reduction tactic, because each of us (ideally) should concentrate on getting the core algorithms right, and not focusing on rebasing code and smoothing over the same d*** merge conflicts over and over. Yes, it's true that people think that a maintainer's only real power is to say 'no' in the hopes of forcing developers to fix everything now because they can't trust that a dev will ever come back with the promised updates, but I reject that 110%. I'm not going anywhere, and I /do/ trust that when the rest of you say that you'll be back with wart remover, you will. > I'm not going to play that "now jump through this hoop" game. We > add flags for on-off behaviours in internal functions -all the > time-. If this makes the interface so complex and confusing that you > don't understand it, then the interface was already too complex and > confusing. And fixing that is *not in the scope of this patchset*. > > Demanding that code be made perfect before it can be merged is > really not very helpful. Especially when there are already plans to > rework the API but that rework is dependent on a bunch of other > changes than need to be done first. > > iclogs are something that need to be moved behind the CIL, not sit > in front of CIL. The CIL abstracts the journal and writing to the > journal completely away from the transaction subsystem, yet the log > force code punches straight through that abstraction and walks > iclogs directly. The whole log force implementation needs to change, > and I plan for the API that wraps the log forces to get reworked at > that time. So here's what I want to know: Do Dave's changes to the log force APIs introduce broken behavior? If the interfaces are so confusing that /none/ of us understand it, can we introduce the appropriate wrappers and documentation so that the rest of us plugging away at the rest of the system can only call it the supported ways to achieve any of the supported outcomes? I'm willing to accept a bunch of documentation and "trivial" wrappers for the rest of us as a shoofly to enable the rest of the xfs developers to keep moving around a messy slow-moving log restructuring without falling into a work pit. However, it's been difficult for me to check that without being able to reference a branch to see that at least the end result looks sane. That was immensely helpful for reviewing Allison's deferred xattrs series. (TLDR: git branch plz) The other way to ease my fears, of course, would be to submit a ton of fstests to examine the log behavior for correctness, but that's difficult to pull off when the control surface is the userspace ABI. > For example, if we want to direct map storage for log writes, then > iclog-based log force synchronisation needs to go away because we > don't need iclogs for buffering journal writes. Hence the log foce > code should interface only with the CIL, and only the CIL should > manage whatever mechanism it is using to write to stable storage. > The code is currently the way it is because the CIL, when first > implemented, had to co-exist with the old way of writing to the log. > We haven't used that old way for a decade now and we have very > different storage performance characteristics these days, so it's > about time we got rid of the mechanism designed to be optimal for > spinning disks and actually integrated the CIL and the log > efficiently. > > There are a *lot* of steps to do this, and reworking the log force > implementation and API is part of that. But reworking that API is > premature because we haven't done all the necessary pre-work in > place to make such a change yet. This patch is actually part of that > pre-work to get the mechanisms that the log force rework will rely > on. > > I have very good reasons for pushing back against your suggestions, > Brian. Your suggestions have merit but this patch is not the time to > be making the changes you suggest. Code does not need to be perfect > to be merged, nor does the entire larger change they will be part of > need to be complete and 100% tested and reviewed before preparation > and infrastructure patches can be merged. This is how we've done big > changes in the past - they've been staged across multiple kernel > cycles - and not everything needs to be done in the first > patchset of a larger body of work.... You mean there's even more beyond the 45 already on the list? /groan/ --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ 2021-03-09 4:35 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Darrick J. Wong @ 2021-03-10 2:10 ` Brian Foster 2021-03-10 22:00 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing Dave Chinner 2021-03-10 15:13 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Brian Foster 2021-03-11 12:41 ` Christoph Hellwig 2 siblings, 1 reply; 33+ messages in thread From: Brian Foster @ 2021-03-10 2:10 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs On Mon, Mar 08, 2021 at 08:35:59PM -0800, Darrick J. Wong wrote: > On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote: > > On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote: > > > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote: > > > > > > > So not only does this patch expose subsystem internals to > > > > > > > external layers and create more log forcing interfaces/behaviors to > > > > > > > > > > > > Sorry, I don't follow. What "subsystem internals" are being exposed > > > > > > and what external layer are they being exposed to? > > > > > > > > > > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and > > > > > > > > changing things like log force behaviour and implementation is for > > > > > > > > a future series. > > > > > > > > > > > > > > > > > > > > > > TBH I think this patch is kind of a mess on its own. I think the > > > > > > > mechanism it wants to provide is sane, but I've not even got to the > > > > > > > point of reviewing _that_ yet because of the seeming dismissal of higher > > > > > > > level feedback. I'd rather not go around in circles on this so I'll just > > > > > > > offer my summarized feedback to this patch... > > > > > > > > > > > > I'm not dismissing review nor am I saying the API cannot or should > > > > > > not be improved. I'm simply telling you that I think the additional > > > > > > changes you are proposing are outside the scope of the problem I am > > > > > > addressing here. I already plan to rework the log force API (and > > > > > > others) but doing so it not something that this patchset needs to > > > > > > address, or indeed should address. > > > > > > > > > > > > > > > > I'm not proposing additional changes nor to rework the log force API. > > > > > I'm pointing out that I find this implementation to be extremely and > > > > > unnecessarily confusing. > > > > > > > > And that's just fine - not everyone has to understand all of the > > > > code all of the time. That's why we have a team.... > > > > > > > > > To improve it, I'm suggesting to either coopt > > > > > the existing async log force API... > > > > > > > > And I'm telling you that this is *future work*. As the person > > > > actually doing the work, that's my decision and you need to accept > > > > that. I understand your concerns and have a plan to address them - > > > > you just have to accept that the plan to address them isn't going to > > > > be exactly to your liking. > > > > > > > > > > As a reviewer, I am pointing out that I object to how this patch is > > > implemented and offered several fairly simple suggestions on how to > > > address my concerns. Those suggestions have been rejected pretty much > > > out of hand on the basis of the existence of some future plans. > > > > We're allowed to disagree on the best approach. But to find a way > > forward means that both the summitter and the reviewer need to > > compromise. I've fixed up the obvious warts and bugs you've pointed > > out, and agreed that it needs cleaning up and have committed to > > cleaning it up in the near future. > > > > > Future rework plans do not justify or change my view of this patch > > > and the mess it adds (for other developers, for historical context > > > and downstreams, etc.). Therefore, if the only option you're > > > willing to consider is "make this mess now and clean it up later > > > in some broader rework," then I'd rather we just defer this patch > > > until after that rework is available and avoid the mess that way. > > > > So you won't review it until I have 100 outstanding patches in this > > series and it's completely and utterly unreviewable? > > Already unreviewable at 45, and I've only gotten through 2/3 of it. > > > Then you'll ask me to split it up and re-order it into digestable > > chunks, and so we'll go back to having to merge this because any of > > the API rework that depends on the mechanism that this patch > > introduces. > > Here's something I haven't previously shared with all of you: Last cycle > when we were going around and around on the ENOSPC/EDQUOT retry loop > patches (which exploded from 13 to 41 patches) it was /very/ stressful > to have to rework this and that part every day and a half for almost > three weeks. > > When I get a patchset ready for submission, I export the patches from my > development branch into a topic branch based off the latest -rc. Every > time anyone makes a suggestion, I update the topic branch and send that > to the fstests "cloud" to see if it broke anything. If it succeeds, I > push it to k.org and resubmit. > > The part that's less obvious is the fact that rebasing is /not/ that > simple. Next I integrate the changes into my development tree and > rebase everything that comes after /that/ to make sure it doesn't > interfere with my longer term development goals. Sometimes this is > easy, sometimes this takes an entire day to work out the warts. > Then I send that to the fstests "cloud". This rebase is particularly > painful because every change that everyone makes to inode > initialization collides with a rework of inode initialization that I've > been working on in preparation for metadata directory trees. > > The part that's even /less/ obvious than that is that once the > development tree tests ok, then I do the same to my xfsprogs dev tree to > make sure that nothing broke. Frequently there are ABI or cognitive > mismatches between kernel and userspace such that the build breaks, and > then I have to patch the two kernel trees and re-run everything. > > So, that's really stressful for me because a minor tweak to an interface > can result in an enormous amount of work. And I reject the argument > that I could just rebase less frequently -- Dave does that, which means > that any time he hears that one of his topic branches is being asked > about, he has to spend weeks rebasing the whole mess to the latest > upstream. Maybe that works for him, but for me, I would hate that even > more than doing a daily rebase. > > Add to that the fact that vger has been a total delivery sh*tshow all > year. Now in addition to feeling disconnected from my family and > friends, I also feel disconnected from work people too. This really did > nearly push me over the edge three weeks ago. > > Please remember, one of the big advantages of our open development > processes is that we /do/ accept code with warty (but functional) > interfaces now, and we can clean them up later. This is (IMHO) a good > stress-reduction tactic, because each of us (ideally) should concentrate > on getting the core algorithms right, and not focusing on rebasing code > and smoothing over the same d*** merge conflicts over and over. > > Yes, it's true that people think that a maintainer's only real power is > to say 'no' in the hopes of forcing developers to fix everything now > because they can't trust that a dev will ever come back with the > promised updates, but I reject that 110%. I'm not going anywhere, and I > /do/ trust that when the rest of you say that you'll be back with wart > remover, you will. > > > I'm not going to play that "now jump through this hoop" game. We > > add flags for on-off behaviours in internal functions -all the > > time-. If this makes the interface so complex and confusing that you > > don't understand it, then the interface was already too complex and > > confusing. And fixing that is *not in the scope of this patchset*. > > > > Demanding that code be made perfect before it can be merged is > > really not very helpful. Especially when there are already plans to > > rework the API but that rework is dependent on a bunch of other > > changes than need to be done first. > > > > iclogs are something that need to be moved behind the CIL, not sit > > in front of CIL. The CIL abstracts the journal and writing to the > > journal completely away from the transaction subsystem, yet the log > > force code punches straight through that abstraction and walks > > iclogs directly. The whole log force implementation needs to change, > > and I plan for the API that wraps the log forces to get reworked at > > that time. > > So here's what I want to know: Do Dave's changes to the log force APIs > introduce broken behavior? If the interfaces are so confusing that > /none/ of us understand it, can we introduce the appropriate wrappers > and documentation so that the rest of us plugging away at the rest of > the system can only call it the supported ways to achieve any of the > supported outcomes? > It looks to me that the changes to xlog_cil_force_seq() could essentially be replaced with something like: /* * Behold the magical async CIL force * ... <explain what this does> ... */ static inline void xfs_cil_force_async( struct xlog *log) { struct xfs_cil *cil = log->l_cilp; /* true for magic async CIL force */ xlog_cil_push_now(log, cil->xc_current_sequence, true); } If so, we wouldn't have to hack up xlog_cil_force_seq() with a flags parameter that provides inconsistent async semantics (and thus show mercy on the log force call stack above it) and otherwise only serves to bypass 95% of that function anyways. I also think that inverting the cil push bool param (and calling it async_push or some such) is a bit more clear because it separates it from XFS_LOG_SYNC (or !XFS_LOG_SYNC) semantics and simultaneously maps directly to the underlying ->xc_push_async control. Brian > I'm willing to accept a bunch of documentation and "trivial" wrappers > for the rest of us as a shoofly to enable the rest of the xfs developers > to keep moving around a messy slow-moving log restructuring without > falling into a work pit. > > However, it's been difficult for me to check that without being able to > reference a branch to see that at least the end result looks sane. That > was immensely helpful for reviewing Allison's deferred xattrs series. > > (TLDR: git branch plz) > > The other way to ease my fears, of course, would be to submit a ton of > fstests to examine the log behavior for correctness, but that's > difficult to pull off when the control surface is the userspace ABI. > > > For example, if we want to direct map storage for log writes, then > > iclog-based log force synchronisation needs to go away because we > > don't need iclogs for buffering journal writes. Hence the log foce > > code should interface only with the CIL, and only the CIL should > > manage whatever mechanism it is using to write to stable storage. > > The code is currently the way it is because the CIL, when first > > implemented, had to co-exist with the old way of writing to the log. > > We haven't used that old way for a decade now and we have very > > different storage performance characteristics these days, so it's > > about time we got rid of the mechanism designed to be optimal for > > spinning disks and actually integrated the CIL and the log > > efficiently. > > > > There are a *lot* of steps to do this, and reworking the log force > > implementation and API is part of that. But reworking that API is > > premature because we haven't done all the necessary pre-work in > > place to make such a change yet. This patch is actually part of that > > pre-work to get the mechanisms that the log force rework will rely > > on. > > > > I have very good reasons for pushing back against your suggestions, > > Brian. Your suggestions have merit but this patch is not the time to > > be making the changes you suggest. Code does not need to be perfect > > to be merged, nor does the entire larger change they will be part of > > need to be complete and 100% tested and reviewed before preparation > > and infrastructure patches can be merged. This is how we've done big > > changes in the past - they've been staged across multiple kernel > > cycles - and not everything needs to be done in the first > > patchset of a larger body of work.... > > You mean there's even more beyond the 45 already on the list? /groan/ > > --D > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing 2021-03-10 2:10 ` Brian Foster @ 2021-03-10 22:00 ` Dave Chinner 0 siblings, 0 replies; 33+ messages in thread From: Dave Chinner @ 2021-03-10 22:00 UTC (permalink / raw) To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs On Tue, Mar 09, 2021 at 09:10:47PM -0500, Brian Foster wrote: > On Mon, Mar 08, 2021 at 08:35:59PM -0800, Darrick J. Wong wrote: > > On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote: > > > On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote: > > > I'm not going to play that "now jump through this hoop" game. We > > > add flags for on-off behaviours in internal functions -all the > > > time-. If this makes the interface so complex and confusing that you > > > don't understand it, then the interface was already too complex and > > > confusing. And fixing that is *not in the scope of this patchset*. > > > > > > Demanding that code be made perfect before it can be merged is > > > really not very helpful. Especially when there are already plans to > > > rework the API but that rework is dependent on a bunch of other > > > changes than need to be done first. > > > > > > iclogs are something that need to be moved behind the CIL, not sit > > > in front of CIL. The CIL abstracts the journal and writing to the > > > journal completely away from the transaction subsystem, yet the log > > > force code punches straight through that abstraction and walks > > > iclogs directly. The whole log force implementation needs to change, > > > and I plan for the API that wraps the log forces to get reworked at > > > that time. > > > > So here's what I want to know: Do Dave's changes to the log force APIs > > introduce broken behavior? If the interfaces are so confusing that > > /none/ of us understand it, can we introduce the appropriate wrappers > > and documentation so that the rest of us plugging away at the rest of > > the system can only call it the supported ways to achieve any of the > > supported outcomes? > > > > It looks to me that the changes to xlog_cil_force_seq() could > essentially be replaced with something like: > > /* > * Behold the magical async CIL force > * ... <explain what this does> ... > */ > static inline void > xfs_cil_force_async( > struct xlog *log) > { > struct xfs_cil *cil = log->l_cilp; > > /* true for magic async CIL force */ > xlog_cil_push_now(log, cil->xc_current_sequence, true); > } Oh, is that what you are asking for? You were talking about changing the API for all the callers that didn't use XFS_LOG_SYNC to make them all async, not about a one-off, single use wrapper for the AIL. If all you want is a single line wrapper function, then don't suggest that the whole API should be reworked and callers changed to use a new API. *Ask for a one-line wrapper to be added*. This wrapper will still need to go away in the near future, but at least it doesn't involve changing lots of unrelated code... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ 2021-03-09 4:35 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Darrick J. Wong 2021-03-10 2:10 ` Brian Foster @ 2021-03-10 15:13 ` Brian Foster 2021-03-11 12:41 ` Christoph Hellwig 2 siblings, 0 replies; 33+ messages in thread From: Brian Foster @ 2021-03-10 15:13 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs (Replying separately to the logistical bits...) On Mon, Mar 08, 2021 at 08:35:59PM -0800, Darrick J. Wong wrote: > On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote: > > On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote: > > > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote: ... > > > > > > As a reviewer, I am pointing out that I object to how this patch is > > > implemented and offered several fairly simple suggestions on how to > > > address my concerns. Those suggestions have been rejected pretty much > > > out of hand on the basis of the existence of some future plans. > > > > We're allowed to disagree on the best approach. But to find a way > > forward means that both the summitter and the reviewer need to > > compromise. I've fixed up the obvious warts and bugs you've pointed > > out, and agreed that it needs cleaning up and have committed to > > cleaning it up in the near future. > > > > > Future rework plans do not justify or change my view of this patch > > > and the mess it adds (for other developers, for historical context > > > and downstreams, etc.). Therefore, if the only option you're > > > willing to consider is "make this mess now and clean it up later > > > in some broader rework," then I'd rather we just defer this patch > > > until after that rework is available and avoid the mess that way. > > > > So you won't review it until I have 100 outstanding patches in this > > series and it's completely and utterly unreviewable? > > Already unreviewable at 45, and I've only gotten through 2/3 of it. > > > Then you'll ask me to split it up and re-order it into digestable > > chunks, and so we'll go back to having to merge this because any of > > the API rework that depends on the mechanism that this patch > > introduces. > > Here's something I haven't previously shared with all of you: Last cycle > when we were going around and around on the ENOSPC/EDQUOT retry loop > patches (which exploded from 13 to 41 patches) it was /very/ stressful > to have to rework this and that part every day and a half for almost > three weeks. > I sympathize. There was at least a week in there that I spent pretty much on nothing but review. :/ > When I get a patchset ready for submission, I export the patches from my > development branch into a topic branch based off the latest -rc. Every > time anyone makes a suggestion, I update the topic branch and send that > to the fstests "cloud" to see if it broke anything. If it succeeds, I > push it to k.org and resubmit. > > The part that's less obvious is the fact that rebasing is /not/ that > simple. Next I integrate the changes into my development tree and > rebase everything that comes after /that/ to make sure it doesn't > interfere with my longer term development goals. Sometimes this is > easy, sometimes this takes an entire day to work out the warts. > Then I send that to the fstests "cloud". This rebase is particularly > painful because every change that everyone makes to inode > initialization collides with a rework of inode initialization that I've > been working on in preparation for metadata directory trees. > > The part that's even /less/ obvious than that is that once the > development tree tests ok, then I do the same to my xfsprogs dev tree to > make sure that nothing broke. Frequently there are ABI or cognitive > mismatches between kernel and userspace such that the build breaks, and > then I have to patch the two kernel trees and re-run everything. > > So, that's really stressful for me because a minor tweak to an interface > can result in an enormous amount of work. And I reject the argument > that I could just rebase less frequently -- Dave does that, which means > that any time he hears that one of his topic branches is being asked > about, he has to spend weeks rebasing the whole mess to the latest > upstream. Maybe that works for him, but for me, I would hate that even > more than doing a daily rebase. > > Add to that the fact that vger has been a total delivery sh*tshow all > year. Now in addition to feeling disconnected from my family and > friends, I also feel disconnected from work people too. This really did > nearly push me over the edge three weeks ago. > Sorry. :( > Please remember, one of the big advantages of our open development > processes is that we /do/ accept code with warty (but functional) > interfaces now, and we can clean them up later. This is (IMHO) a good > stress-reduction tactic, because each of us (ideally) should concentrate > on getting the core algorithms right, and not focusing on rebasing code > and smoothing over the same d*** merge conflicts over and over. > I agree that we can and do accept code as such in certain cases, but not always or by default. Usually, only when it makes sense due to complexity, cross subsystem issues, etc. If the situation is the submitter has sent a large patch series at a point where it's already so unwieldy to change because of non-upstream reasons that some restricted form of review is required, then I think that's a bit unfair to the upstream community. Using the whole block reclaim/retry loop thing as an example (just because it was the most recent example of this kind of thing causing you grief), perhaps those foundational changes should have been sent to the list earlier before they became "too unwieldy to change?" I think it's perfectly reasonable to send lightly tested or compile tested only patches for initial thoughts, factoring feedback, etc., particularly if a change proposes to add retry loops to various transaction allocation callers. That's a fairly significant change in some core areas. ISTM that on one hand review is fairly open ended and it's not really up to the submitter to declare the scope of reviewer feedback (the submitter is then certainly free to change certain things or not, but the goal should be to cooperate toward some common ground). OTOH I suppose the open development model allows for a community to determine a standard for if/how they prefer to review patches. That said, I personally don't like the "accept ugly code now, fix it later over time" model by default because for one, I think it leads to a lower quality mainline. I also think there's an implied assumption in there that a reviewer wants to review endless series of cleanup patches after providing such feedback on early versions of functionality. I know if I take the time to review a 30+ patch series and had some non-negligible amount of aesthetic feedback, I'm not really looking to review another 20+ patches some time later just to clean up a mess that the first series might have (unnecessarily) created, particularly when the feedback has already been provided and so iterative review is more time efficient (I find review much more time consuming than development). I dunno. That's just my .02 I guess. I do agree with Dave's previous statements that not everything has to be 100% mutually agreeable. I generally try to accommodate that by incorporating suggestions into my own patches that I might not necessarily agree with, but don't feel strongly enough about, to save time and keep things moving. Conversely on the review side, I generally try to defer to other reviewers to see whether my particular position might be in the majority or minority, and go with that (That doesn't always mean I'm going to put reviewed-by tags on patches I really don't like. In some cases, I think reviewing for function/bugs and not nacking a patch is a form of compromise. ;). > Yes, it's true that people think that a maintainer's only real power is > to say 'no' in the hopes of forcing developers to fix everything now > because they can't trust that a dev will ever come back with the > promised updates, but I reject that 110%. I'm not going anywhere, and I > /do/ trust that when the rest of you say that you'll be back with wart > remover, you will. > I don't distrust that any XFS developers would make promised fixes as such. I don't really see that as a problem at all. I think the whole subsystem rework thing is pretty much a distraction from the original feedback. E.g., I think your wrapper idea (explored in my previous reply) is yet another reasonable solution that doesn't involve or require any sort of broader rework. Brian > > I'm not going to play that "now jump through this hoop" game. We > > add flags for on-off behaviours in internal functions -all the > > time-. If this makes the interface so complex and confusing that you > > don't understand it, then the interface was already too complex and > > confusing. And fixing that is *not in the scope of this patchset*. > > > > Demanding that code be made perfect before it can be merged is > > really not very helpful. Especially when there are already plans to > > rework the API but that rework is dependent on a bunch of other > > changes than need to be done first. > > > > iclogs are something that need to be moved behind the CIL, not sit > > in front of CIL. The CIL abstracts the journal and writing to the > > journal completely away from the transaction subsystem, yet the log > > force code punches straight through that abstraction and walks > > iclogs directly. The whole log force implementation needs to change, > > and I plan for the API that wraps the log forces to get reworked at > > that time. > > So here's what I want to know: Do Dave's changes to the log force APIs > introduce broken behavior? If the interfaces are so confusing that > /none/ of us understand it, can we introduce the appropriate wrappers > and documentation so that the rest of us plugging away at the rest of > the system can only call it the supported ways to achieve any of the > supported outcomes? > > I'm willing to accept a bunch of documentation and "trivial" wrappers > for the rest of us as a shoofly to enable the rest of the xfs developers > to keep moving around a messy slow-moving log restructuring without > falling into a work pit. > > However, it's been difficult for me to check that without being able to > reference a branch to see that at least the end result looks sane. That > was immensely helpful for reviewing Allison's deferred xattrs series. > > (TLDR: git branch plz) > > The other way to ease my fears, of course, would be to submit a ton of > fstests to examine the log behavior for correctness, but that's > difficult to pull off when the control surface is the userspace ABI. > > > For example, if we want to direct map storage for log writes, then > > iclog-based log force synchronisation needs to go away because we > > don't need iclogs for buffering journal writes. Hence the log foce > > code should interface only with the CIL, and only the CIL should > > manage whatever mechanism it is using to write to stable storage. > > The code is currently the way it is because the CIL, when first > > implemented, had to co-exist with the old way of writing to the log. > > We haven't used that old way for a decade now and we have very > > different storage performance characteristics these days, so it's > > about time we got rid of the mechanism designed to be optimal for > > spinning disks and actually integrated the CIL and the log > > efficiently. > > > > There are a *lot* of steps to do this, and reworking the log force > > implementation and API is part of that. But reworking that API is > > premature because we haven't done all the necessary pre-work in > > place to make such a change yet. This patch is actually part of that > > pre-work to get the mechanisms that the log force rework will rely > > on. > > > > I have very good reasons for pushing back against your suggestions, > > Brian. Your suggestions have merit but this patch is not the time to > > be making the changes you suggest. Code does not need to be perfect > > to be merged, nor does the entire larger change they will be part of > > need to be complete and 100% tested and reviewed before preparation > > and infrastructure patches can be merged. This is how we've done big > > changes in the past - they've been staged across multiple kernel > > cycles - and not everything needs to be done in the first > > patchset of a larger body of work.... > > You mean there's even more beyond the 45 already on the list? /groan/ > > --D > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ 2021-03-09 4:35 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Darrick J. Wong 2021-03-10 2:10 ` Brian Foster 2021-03-10 15:13 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Brian Foster @ 2021-03-11 12:41 ` Christoph Hellwig 2 siblings, 0 replies; 33+ messages in thread From: Christoph Hellwig @ 2021-03-11 12:41 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, Brian Foster, linux-xfs On Mon, Mar 08, 2021 at 08:35:59PM -0800, Darrick J. Wong wrote: > > So you won't review it until I have 100 outstanding patches in this > > series and it's completely and utterly unreviewable? > > Already unreviewable at 45, and I've only gotten through 2/3 of it. Yes. For patches that don't just repetitively apply similar changes to a few places, about 20 patches is the max that is digestable. > Here's something I haven't previously shared with all of you: Last cycle > when we were going around and around on the ENOSPC/EDQUOT retry loop > patches (which exploded from 13 to 41 patches) it was /very/ stressful > to have to rework this and that part every day and a half for almost > three weeks. As someone part of the loop I was a little surprised how quickly you did respin the patches. In general if I have feedback that requires a major rework of a non-trivial series, I do not want to touch it instantly. Let the discussion continue a bit, becaue it can easily turn into another direction and create more work. I think waiting a few days before doing anything that involves a lot of work generally helpsto make everyones life a little easier. > (TLDR: git branch plz) Yes, for any non-trivial series that really, really helps. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing 2021-03-09 0:44 ` Dave Chinner 2021-03-09 4:35 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Darrick J. Wong @ 2021-03-10 14:49 ` Brian Foster 1 sibling, 0 replies; 33+ messages in thread From: Brian Foster @ 2021-03-10 14:49 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Tue, Mar 09, 2021 at 11:44:10AM +1100, Dave Chinner wrote: > On Fri, Mar 05, 2021 at 09:58:26AM -0500, Brian Foster wrote: > > On Fri, Mar 05, 2021 at 09:48:48AM +1100, Dave Chinner wrote: > > > > > > So not only does this patch expose subsystem internals to > > > > > > external layers and create more log forcing interfaces/behaviors to > > > > > > > > > > Sorry, I don't follow. What "subsystem internals" are being exposed > > > > > and what external layer are they being exposed to? > > > > > > > > > > > > Cleaning up the mess that is the xfs_log_* and xlog_* interfaces and > > > > > > > changing things like log force behaviour and implementation is for > > > > > > > a future series. > > > > > > > > > > > > > > > > > > > TBH I think this patch is kind of a mess on its own. I think the > > > > > > mechanism it wants to provide is sane, but I've not even got to the > > > > > > point of reviewing _that_ yet because of the seeming dismissal of higher > > > > > > level feedback. I'd rather not go around in circles on this so I'll just > > > > > > offer my summarized feedback to this patch... > > > > > > > > > > I'm not dismissing review nor am I saying the API cannot or should > > > > > not be improved. I'm simply telling you that I think the additional > > > > > changes you are proposing are outside the scope of the problem I am > > > > > addressing here. I already plan to rework the log force API (and > > > > > others) but doing so it not something that this patchset needs to > > > > > address, or indeed should address. > > > > > > > > > > > > > I'm not proposing additional changes nor to rework the log force API. > > > > I'm pointing out that I find this implementation to be extremely and > > > > unnecessarily confusing. > > > > > > And that's just fine - not everyone has to understand all of the > > > code all of the time. That's why we have a team.... > > > > > > > To improve it, I'm suggesting to either coopt > > > > the existing async log force API... > > > > > > And I'm telling you that this is *future work*. As the person > > > actually doing the work, that's my decision and you need to accept > > > that. I understand your concerns and have a plan to address them - > > > you just have to accept that the plan to address them isn't going to > > > be exactly to your liking. > > > > > > > As a reviewer, I am pointing out that I object to how this patch is > > implemented and offered several fairly simple suggestions on how to > > address my concerns. Those suggestions have been rejected pretty much > > out of hand on the basis of the existence of some future plans. > > We're allowed to disagree on the best approach. But to find a way > forward means that both the summitter and the reviewer need to > compromise. I've fixed up the obvious warts and bugs you've pointed > out, and agreed that it needs cleaning up and have committed to > cleaning it up in the near future. > > > Future rework plans do not justify or change my view of this patch > > and the mess it adds (for other developers, for historical context > > and downstreams, etc.). Therefore, if the only option you're > > willing to consider is "make this mess now and clean it up later > > in some broader rework," then I'd rather we just defer this patch > > until after that rework is available and avoid the mess that way. > > So you won't review it until I have 100 outstanding patches in this > series and it's completely and utterly unreviewable? Then you'll ask > me to split it up and re-order it into digestable chunks, and so > we'll go back to having to merge this because any of the API rework > that depends on the mechanism that this patch introduces. > (Well, I'm not reviewing it now because I'm on PTO this week...) > I'm not going to play that "now jump through this hoop" game. We > add flags for on-off behaviours in internal functions -all the > time-. If this makes the interface so complex and confusing that you > don't understand it, then the interface was already too complex and > confusing. And fixing that is *not in the scope of this patchset*. > I'm pretty sure creating a flag for the one-off behavior here was one of my earlier suggestions. Based on the above, what's the problem with that? I think Darrick's suggestion (re: my previous reply) around a wrapper approach is quite reasonable as well. > Demanding that code be made perfect before it can be merged is > really not very helpful. Especially when there are already plans to > rework the API but that rework is dependent on a bunch of other > changes than need to be done first. > "Jumping through hoops" on 100 patch series? Demanding perfection? That's a little extreme for first pass feedback on v1 of a 3 patch series, don't you think? My position on this patch in its current form is that regardless of any pending rework, it is unnecessarily gross. As stated numerous times, I think there are various options for small tweaks that make it passable for the time being. Let me know if you become willing to entertain the notion that this patch might stand to improve independent from some broader rework and I'm happy to revisit any of those options in further detail. In the meantime, I'm growing a little tired of the hyperbolic replies around playing games, jumping through hoops, demanding perfection, or whatever other nonsense. I've invested the better part of a week (and now into PTO) in review of this patch alone because I feel strongly about the result and am willing to contribute toward a solution, not because I want to waste time or play games. I don't feel in kind about the replies. Brian P.S., FWIW, I certainly don't expect all feedback to be incorporated during review. On balance, I'd say a reasonable breakdown is that probably half makes sense, another half might be bogus, and then some percentage of that comes down to preference between reviewer and author where I'll try to defer to the patch author when I feel there's at least been some good faith discussion or justification for the current approach. "No, that's obfuscation" or "no, I'll do it later" are not productive arguments if you're looking for compromise. Those leave no room for discussion. There's no counter suggestion to try and address the immediate concern and work towards something mutually agreeable, or even any opportunity to arrive at the conclusion that the original patch might be the ideal outcome after all. > iclogs are something that need to be moved behind the CIL, not sit > in front of CIL. The CIL abstracts the journal and writing to the > journal completely away from the transaction subsystem, yet the log > force code punches straight through that abstraction and walks > iclogs directly. The whole log force implementation needs to change, > and I plan for the API that wraps the log forces to get reworked at > that time. > > For example, if we want to direct map storage for log writes, then > iclog-based log force synchronisation needs to go away because we > don't need iclogs for buffering journal writes. Hence the log foce > code should interface only with the CIL, and only the CIL should > manage whatever mechanism it is using to write to stable storage. > The code is currently the way it is because the CIL, when first > implemented, had to co-exist with the old way of writing to the log. > We haven't used that old way for a decade now and we have very > different storage performance characteristics these days, so it's > about time we got rid of the mechanism designed to be optimal for > spinning disks and actually integrated the CIL and the log > efficiently. > > There are a *lot* of steps to do this, and reworking the log force > implementation and API is part of that. But reworking that API is > premature because we haven't done all the necessary pre-work in > place to make such a change yet. This patch is actually part of that > pre-work to get the mechanisms that the log force rework will rely > on. > > I have very good reasons for pushing back against your suggestions, > Brian. Your suggestions have merit but this patch is not the time to > be making the changes you suggest. Code does not need to be perfect > to be merged, nor does the entire larger change they will be part of > need to be complete and 100% tested and reviewed before preparation > and infrastructure patches can be merged. This is how we've done big > changes in the past - they've been staged across multiple kernel > cycles - and not everything needs to be done in the first > patchset of a larger body of work.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing 2021-02-24 21:10 ` Dave Chinner 2021-02-24 23:26 ` [PATCH 2/3 v2] " Dave Chinner @ 2021-02-25 13:12 ` Brian Foster 2021-02-25 22:03 ` Dave Chinner 1 sibling, 1 reply; 33+ messages in thread From: Brian Foster @ 2021-02-25 13:12 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote: > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > The AIL pushing is stalling on log forces when it comes across > > pinned items. This is happening on removal workloads where the AIL > > is dominated by stale items that are removed from AIL when the > > checkpoint that marks the items stale is committed to the journal. > > This results is relatively few items in the AIL, but those that are > > are often pinned as directories items are being removed from are > > still being logged. > > > > As a result, many push cycles through the CIL will first issue a > > blocking log force to unpin the items. This can take some time to > > complete, with tracing regularly showing push delays of half a > > second and sometimes up into the range of several seconds. Sequences > > like this aren't uncommon: > > > > .... > > 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 > > <wanted 20ms, got 270ms delay> > > 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 > > 400.099623: xfsaild: first lsn 0x11002f3600 > > 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 > > <wanted 50ms, got 500ms delay> > > 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 > > 400.589349: xfsaild: first lsn 0x1100305000 > > 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 > > <wanted 50ms, got 460ms delay> > > 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 > > 400.950343: xfsaild: first lsn 0x1100317c00 > > 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 > > <wanted 20ms, got 200ms delay> > > 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 > > 401.142334: xfsaild: first lsn 0x110032e600 > > 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 > > <wanted 10ms, got 10ms delay> > > 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 > > 401.154328: xfsaild: first lsn 0x1100353000 > > 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 > > <wanted 20ms, got 300ms delay> > > 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 > > 401.451526: xfsaild: first lsn 0x1100353000 > > 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 > > <wanted 50ms, got 500ms delay> > > 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 > > .... > > > > In each of these cases, every AIL pass saw 101 log items stuck on > > the AIL (pinned) with very few other items being found. Each pass, a > > log force was issued, and delay between last/first is the sleep time > > + the sync log force time. > > > > Some of these 101 items pinned the tail of the log. The tail of the > > log does slowly creep forward (first lsn), but the problem is that > > the log is actually out of reservation space because it's been > > running so many transactions that stale items that never reach the > > AIL but consume log space. Hence we have a largely empty AIL, with > > long term pins on items that pin the tail of the log that don't get > > pushed frequently enough to keep log space available. > > > > The problem is the hundreds of milliseconds that we block in the log > > force pushing the CIL out to disk. The AIL should not be stalled > > like this - it needs to run and flush items that are at the tail of > > the log with minimal latency. What we really need to do is trigger a > > log flush, but then not wait for it at all - we've already done our > > waiting for stuff to complete when we backed off prior to the log > > force being issued. > > > > Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we > > still do a blocking flush of the CIL and that is what is causing the > > issue. Hence we need a new interface for the CIL to trigger an > > immediate background push of the CIL to get it moving faster but not > > to wait on that to occur. While the CIL is pushing, the AIL can also > > be pushing. > > > > We already have an internal interface to do this - > > xlog_cil_push_now() - but we need a wrapper for it to be used > > externally. xlog_cil_force_seq() can easily be extended to do what > > we need as it already implements the synchronous CIL push via > > xlog_cil_push_now(). Add the necessary flags and "push current > > sequence" semantics to xlog_cil_force_seq() and convert the AIL > > pushing to use it. > > Overnight testing indicates generic/530 hangs on small logs with > this patch. It essentially runs out of log space because there are > inode cluster buffers permanently pinned by this workload. > > That is, as inodes are unlinked, they repeatedly relog the inode > cluster buffer, and so while the CIL is flushing to unpin the > buffer, a new unlink relogs it and adds a new pin to the buffer. > Hence the log force that the AIL does to unpin pinned items doesn't > actually unpin buffers that are being relogged. > > These buffers only get unpinned when the log actually runs out of > space because they pin the tail of the log. Then the modifications > that are relogging the buffer stop because they fail to get a log > reservation, and the tail of the log does not move forward until the > AIL issues a log force that finally unpins the buffer and writes it > back. > > Essentially, relogged buffers cannot be flushed by the AIL until the > log completely stalls. > > The problem being tripped over here is that we no longer force the > final iclogs in a CIL push to disk - we leave the iclog with the > commit record in it in ACTIVE state, and by not waiting and forcing > all the iclogs to disk, the buffer never gets unpinned because there > isn't any more log pressure to force it out because everything is > blocked on reservation space. > > The solution to this is to have the CIL push change the state of the > commit iclog to WANT_SYNC before it is released. This means the CIL > push will always flush the iclog to disk and the checkpoint will > complete and unpin the buffers. > > Right now, we really only want to do this state switch for these > async pushes - for small sync transactions and fsync we really want > the iclog aggregation that we have now to optimise iclogbuf usage, > so I'll have to pass a new flag through the push code and back into > xlog_write(). That will make this patch behave the same as we > currently do. > Unfortunately I've not yet caught up to reviewing your most recently posted set of log patches so I can easily be missing some context, but when passing through some of the feedback/updates so far this has me rather confused. We discussed this pre-existing CIL behavior in the previous version, I suggested some similar potential behavior change where we would opportunistically send off checkpoint iclogs for I/O a bit earlier than normal and you argued [1] against it. Now it sounds like not only are we implementing that, but it's actually necessary to fix a log hang problem..? What am I missing? The updated iclog behavior does sound more friendly to me than what we currently do (obviously, based on my previous comments), but I am slightly skeptical of how such a change fixes the root cause of a hang. Is this a stall/perf issue or an actual log deadlock? If the latter, what prevents this deadlock on current upstream? Brian [1] https://lore.kernel.org/linux-xfs/20210129222559.GT4662@dread.disaster.area/ > In the longer term, we need to change how the AIL deals with pinned > buffers, as the current method is definitely sub-optimal. It also > explains the "everything stops for a second" performance anomalies > I've been seeing for a while in testing. But fixing that is outside > the scope of this patchset, so in the mean time I'll fix this one up > and repost it in a little while. > > FWIW, this is also the likely cause of the "CIL workqueue too deep" > hangs I was seeing with the next patch in the series, too. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing 2021-02-25 13:12 ` [PATCH 2/3] " Brian Foster @ 2021-02-25 22:03 ` Dave Chinner 2021-02-27 16:25 ` Brian Foster 0 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2021-02-25 22:03 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote: > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote: > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > > The problem being tripped over here is that we no longer force the > > final iclogs in a CIL push to disk - we leave the iclog with the > > commit record in it in ACTIVE state, and by not waiting and forcing > > all the iclogs to disk, the buffer never gets unpinned because there > > isn't any more log pressure to force it out because everything is > > blocked on reservation space. > > > > The solution to this is to have the CIL push change the state of the > > commit iclog to WANT_SYNC before it is released. This means the CIL > > push will always flush the iclog to disk and the checkpoint will > > complete and unpin the buffers. > > > > Right now, we really only want to do this state switch for these > > async pushes - for small sync transactions and fsync we really want > > the iclog aggregation that we have now to optimise iclogbuf usage, > > so I'll have to pass a new flag through the push code and back into > > xlog_write(). That will make this patch behave the same as we > > currently do. > > > > Unfortunately I've not yet caught up to reviewing your most recently > posted set of log patches so I can easily be missing some context, but > when passing through some of the feedback/updates so far this has me > rather confused. We discussed this pre-existing CIL behavior in the > previous version, I suggested some similar potential behavior change > where we would opportunistically send off checkpoint iclogs for I/O a > bit earlier than normal and you argued [1] against it. Now it sounds > like not only are we implementing that, but it's actually necessary to > fix a log hang problem..? What am I missing? You're talking about this comment? | > That means there's an increased chance that | > the next iclog in the ring may be active. Perhaps we could introduce | > some logic to switch out the commit record iclog before the CIL push | > returns in those particular cases. For example, switch out if the | > current checkpoint hit multiple iclogs..? Or hit multiple iclogs and the | > next iclog in the ring is clean..? | | We could just call xlog_state_switch_iclogs() to mark it WANT_SYNC, | but then we can't aggregate more changes into it and fill it up. If | someone starts waiting on that iclog (i.e. a log force), then it | immediately gets marked WANT_SYNC and submitted to disk when it is | released. But if there is no-one waiting on it, then we largely | don't care if an asynchronous checkpoint is committed immediately, | at the start of the next checkpoint, or at worst, within 30s when | the log worker next kicks a log force.... I'm arguing against the general case here of submitting iclog buffers with checkpoints in them early. THere is no reason to do that because the xfs_log_force() code will kick the iclog to disk if necessary. That still stands - I have not changed behaviour unconditionally. This patch only kicks the commit_iclog to disk if the cause of the CIL push was an async call to xlog_cil_force_seq(). That can occur from xfs_iunpin() and, with this patch, the AIL. In both cases, all they care about is that the CIL checkpoint fully completes in the near future. For xfs_iunpin(), the log force doesn't guarantee the inode is actually unpinned, hence it waits on the pin count and not the log force. So waiting for the CIL push is unnecessary as long as the CIL push gets the entire checkpoint to disk. Similarly, the AIL only needs the log force to guarantee the CIL commits to disk in the near future, but it does not need to block waiting for that to occur. So in these cases, we don't want to block while the CIL is pushing just so the log force code can iterate the last iclog the CIL wrote to flush it to disk. Instead, we pass a flag to the CIL to tell it "you need to push the commit record to disk" so we don't have to wait for it to then check if it pushed the commit record or not. This is a special case, not the general case where we are running lots of fsync or sync transactions. In the general case, we are trying to aggregate as many sync transactions/fsync log forces as possible into the current iclog to batch them up effectively. We'll continue to batch them while there is space in the current iclog and the previous iclog is in the WANT_SYNC/SYNCING state indicating it is still being written to disk. This automated batching behaviour of the iclog state machine is one of the reasons XFS has always had excellent performance under synci transaction/fsync heavy workloads such as NFS/CIFS servers. It is functionality we need to preserve, and this patch does not actually change that behaviour. > The updated iclog behavior does sound more friendly to me than what we > currently do (obviously, based on my previous comments), As per above, the conditional behaviour I've added here is actually less friendly :) > but I am > slightly skeptical of how such a change fixes the root cause of a hang. > Is this a stall/perf issue or an actual log deadlock? If the latter, > what prevents this deadlock on current upstream? The problem that I talk about is introduced by this patch and also fixed by this patch. The problem does not exist until we have code that depends on an async CIL push guaranteeing that items in the CIL are unpinned. A stock xfs_log_force_seq(0) call guarantees that the log reaches disk by waiting on the CIL push then pushing the iclogs. All it does is elide waiting for the final iclogs to hit the disk. By eliding the "wait for CIL push" in an async xfs_log_force_seq() call, we also elide the "push iclogs" part of the log force and that's what causes the AIL problem. We don't push the iclogs, so the items committed to the last iclog don't get dispatched and the items never get unpinned because the checkpoint never completes. So this "AIL can hang because tail items aren't unpinned" condition is exposed by async CIL forces. It is also fixed by this patch passing the async push state to the CIL push and having it ensure that the commit iclog if written out if the CIL push was triggered by an async push. Like all log forces, async pushes are rare, but we have to ensure tehy behave correctly. There is no bug in the code before this patch, and there is no bug in the code (that I know of) after this patch. I've described a condition that we have to handle in this patch that, if we don't, we end up with broken AIL pushing. As to the "root cause of the hang" I now know what that is too, but addressing it is outside the scope of this patchset. It's taken me a long time to identify this, but I've been seeing symptoms of it for quite a long time (i.e. years) but until doing this async CIL push work I've never been able to identify the cause. That is, the root cause is that the AIL cannot force a pinned buffer to disk, and a log force (of any kind) does not guarantee a pinned item is unpinned. That's because the CIL is pipelined, meaning that an item can be pinned in one CIL context, and while that is being pushed, the front end can relog that same item and pin it in the current CIL context. Hence it has two pins, not one. Now when the original log force completes, the pin from the first CIL context has been removed, but the pin from the current CIL context is still present. Hence when we are relogging buffers over time (e.g. repeatedly modifying a directory), the AIL always sees the buffer pinned and cannot flush it. Even if the buffer has been moved to the delwri list (i.e. had a successful push) we still can't write it out because the delwri code skips pinned buffers. This iterative pin continues preventing the AIL from flushing the buffer until it pins the tail of the log. When the log runs out of space and blocks the modifying/relogging task(s), the log force the AIL issues allows the CIL to be pushed without the front end relogging the item and re-pinning the item. At the completion of the log force, the AIL can now push the unpinned buffer and hence move the tail of the log forward which frees up log reservation space. This then wakes the front end, which goes back to it's relogging game and the cycle repeats over. This is really nasty behaviour, and it's only recently that I've got a handle on it. I found it because my original "async CIL push" code resulted in long stalls every time the log is filled and the tail is pinned by a buffer that is being relogged in this manner.... I'm not sure how to fix this yet - the AIL needs to block the front end relogging to allow the buffer to be unpinned. Essentially, we need to hold the pinned items locked across a CIL push to guarantee they are unpinned, but that's the complete opposite of what the AIL currently does to prevent the front end from seeing long tail lock latencies when modifying stuff.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing 2021-02-25 22:03 ` Dave Chinner @ 2021-02-27 16:25 ` Brian Foster 2021-03-01 4:54 ` Dave Chinner 0 siblings, 1 reply; 33+ messages in thread From: Brian Foster @ 2021-02-27 16:25 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote: > On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote: > > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote: > > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > > > The problem being tripped over here is that we no longer force the > > > final iclogs in a CIL push to disk - we leave the iclog with the > > > commit record in it in ACTIVE state, and by not waiting and forcing > > > all the iclogs to disk, the buffer never gets unpinned because there > > > isn't any more log pressure to force it out because everything is > > > blocked on reservation space. > > > > > > The solution to this is to have the CIL push change the state of the > > > commit iclog to WANT_SYNC before it is released. This means the CIL > > > push will always flush the iclog to disk and the checkpoint will > > > complete and unpin the buffers. > > > > > > Right now, we really only want to do this state switch for these > > > async pushes - for small sync transactions and fsync we really want > > > the iclog aggregation that we have now to optimise iclogbuf usage, > > > so I'll have to pass a new flag through the push code and back into > > > xlog_write(). That will make this patch behave the same as we > > > currently do. > > > > > > > Unfortunately I've not yet caught up to reviewing your most recently > > posted set of log patches so I can easily be missing some context, but > > when passing through some of the feedback/updates so far this has me > > rather confused. We discussed this pre-existing CIL behavior in the > > previous version, I suggested some similar potential behavior change > > where we would opportunistically send off checkpoint iclogs for I/O a > > bit earlier than normal and you argued [1] against it. Now it sounds > > like not only are we implementing that, but it's actually necessary to > > fix a log hang problem..? What am I missing? > > You're talking about this comment? > Yes. > | > That means there's an increased chance that > | > the next iclog in the ring may be active. Perhaps we could introduce > | > some logic to switch out the commit record iclog before the CIL push > | > returns in those particular cases. For example, switch out if the > | > current checkpoint hit multiple iclogs..? Or hit multiple iclogs and the > | > next iclog in the ring is clean..? > | > | We could just call xlog_state_switch_iclogs() to mark it WANT_SYNC, > | but then we can't aggregate more changes into it and fill it up. If > | someone starts waiting on that iclog (i.e. a log force), then it > | immediately gets marked WANT_SYNC and submitted to disk when it is > | released. But if there is no-one waiting on it, then we largely > | don't care if an asynchronous checkpoint is committed immediately, > | at the start of the next checkpoint, or at worst, within 30s when > | the log worker next kicks a log force.... > > I'm arguing against the general case here of submitting iclog > buffers with checkpoints in them early. THere is no reason to do that > because the xfs_log_force() code will kick the iclog to disk if > necessary. That still stands - I have not changed behaviour > unconditionally. > But the suggestion wasn't to change behavior unconditionally. That seems fairly clear to me even from just the quoted snippet above..? > This patch only kicks the commit_iclog to disk if the cause of the > CIL push was an async call to xlog_cil_force_seq(). That can occur > from xfs_iunpin() and, with this patch, the AIL. > > In both cases, all they care about is that the CIL checkpoint fully > completes in the near future. For xfs_iunpin(), the log force > doesn't guarantee the inode is actually unpinned, hence it waits on > the pin count and not the log force. So waiting for the CIL push is > unnecessary as long as the CIL push gets the entire checkpoint to > disk. > > Similarly, the AIL only needs the log force to guarantee the CIL > commits to disk in the near future, but it does not need to block > waiting for that to occur. > > So in these cases, we don't want to block while the CIL is pushing > just so the log force code can iterate the last iclog the CIL wrote > to flush it to disk. Instead, we pass a flag to the CIL to tell it > "you need to push the commit record to disk" so we don't have to > wait for it to then check if it pushed the commit record or not. > This is a special case, not the general case where we are running > lots of fsync or sync transactions. > > In the general case, we are trying to aggregate as many sync > transactions/fsync log forces as possible into the current iclog to > batch them up effectively. We'll continue to batch them while there > is space in the current iclog and the previous iclog is in the > WANT_SYNC/SYNCING state indicating it is still being written to > disk. > > This automated batching behaviour of the iclog state machine is one > of the reasons XFS has always had excellent performance under > synci transaction/fsync heavy workloads such as NFS/CIFS servers. It > is functionality we need to preserve, and this patch does not > actually change that behaviour. > Sure, this all makes sense. > > The updated iclog behavior does sound more friendly to me than what we > > currently do (obviously, based on my previous comments), > > As per above, the conditional behaviour I've added here is actually > less friendly :) > Heh. :P > > but I am > > slightly skeptical of how such a change fixes the root cause of a hang. > > Is this a stall/perf issue or an actual log deadlock? If the latter, > > what prevents this deadlock on current upstream? > > The problem that I talk about is introduced by this patch and also > fixed by this patch. The problem does not exist until we have code > that depends on an async CIL push guaranteeing that items in the CIL > are unpinned. A stock xfs_log_force_seq(0) call guarantees that the > log reaches disk by waiting on the CIL push then pushing the iclogs. > All it does is elide waiting for the final iclogs to hit the disk. > > By eliding the "wait for CIL push" in an async xfs_log_force_seq() > call, we also elide the "push iclogs" part of the log force and > that's what causes the AIL problem. We don't push the iclogs, so the > items committed to the last iclog don't get dispatched and the items > never get unpinned because the checkpoint never completes. > So in general it sounds like this patch broke async log force by skipping the current iclog switch somehow or another and then this change is incorporated to fix it (hence there is no outstanding bug upstream). That makes a bit more sense. I'll get the detailed context from the patches. > So this "AIL can hang because tail items aren't unpinned" condition > is exposed by async CIL forces. It is also fixed by this patch > passing the async push state to the CIL push and having it ensure > that the commit iclog if written out if the CIL push was triggered > by an async push. > > Like all log forces, async pushes are rare, but we have to ensure > tehy behave correctly. There is no bug in the code before this > patch, and there is no bug in the code (that I know of) after this > patch. I've described a condition that we have to handle in this > patch that, if we don't, we end up with broken AIL pushing. > Ok. > As to the "root cause of the hang" I now know what that is too, but > addressing it is outside the scope of this patchset. It's taken me a > long time to identify this, but I've been seeing symptoms of it for > quite a long time (i.e. years) but until doing this async CIL push > work I've never been able to identify the cause. > > That is, the root cause is that the AIL cannot force a pinned buffer > to disk, and a log force (of any kind) does not guarantee a pinned > item is unpinned. That's because the CIL is pipelined, meaning that > an item can be pinned in one CIL context, and while that is being > pushed, the front end can relog that same item and pin it in the > current CIL context. Hence it has two pins, not one. Now when the > original log force completes, the pin from the first CIL context has > been removed, but the pin from the current CIL context is still > present. > Somewhat related to the above point, ISTM our current behavior can lead to scenarios where the potential window for elevated pin counts is somewhat artificially widened. IOW, the log vector for a pincount == 1 item might have been written out to disk, but the log item itself not unpinned for some time later because the commit record is still sitting in an active iclog. The unpin processing of the item doesn't occur until that commit record iclog is synced out and I/O completes, which leaves plenty of time for another pin to come along in the meantime (and the scenario potentially repeats, starving out pincount == 0 state). So ISTM the "damage" is done to some degree long before the AIL has any opportunity to fix it, particularly if the only recourse it has is a log force. (And of course there's a huge tradeoff between current behavior and something that unconditionally synced out commit record iclogs, particularly depending on CIL checkpoint sizes, so that's a broad simplification and not a valid solution.) > Hence when we are relogging buffers over time (e.g. repeatedly > modifying a directory), the AIL always sees the buffer pinned and > cannot flush it. Even if the buffer has been moved to the delwri > list (i.e. had a successful push) we still can't write it out > because the delwri code skips pinned buffers. > > This iterative pin continues preventing the AIL from flushing the > buffer until it pins the tail of the log. When the log runs out of > space and blocks the modifying/relogging task(s), the log force the > AIL issues allows the CIL to be pushed without the front end > relogging the item and re-pinning the item. At the completion of the > log force, the AIL can now push the unpinned buffer and hence move > the tail of the log forward which frees up log reservation space. > This then wakes the front end, which goes back to it's relogging > game and the cycle repeats over. > The description of the nested pin behavior makes sense. I've observed the same thing recently when looking at some unrelated shutdown problems. In any event, if I'm following correctly the deadlock issue was a transient problem with this patch and not an upstream problem. However there is still a pipeline stall/performance issue that manifests as described above, and that preexists the proposed changes. Yes? > This is really nasty behaviour, and it's only recently that I've got > a handle on it. I found it because my original "async CIL push" code > resulted in long stalls every time the log is filled and the tail is > pinned by a buffer that is being relogged in this manner.... > > I'm not sure how to fix this yet - the AIL needs to block the front > end relogging to allow the buffer to be unpinned. Essentially, we > need to hold the pinned items locked across a CIL push to guarantee > they are unpinned, but that's the complete opposite of what the AIL > currently does to prevent the front end from seeing long tail lock > latencies when modifying stuff.... > When this stall problem manifests, I'm assuming it's exacerbated by delayed logging and the commit record behavior I described above. If that's the case, could the AIL communicate writeback pressure through affected log items such that checkpoints in which they are resident are flushed out completely/immediately when the checkpoints occur? I suppose that would require a log item flag or some such, which does raise a concern of unnecessarily tagging many items (it's not clear to me how likely that really is), but I'm curious if that would be an effective POC at least.. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing 2021-02-27 16:25 ` Brian Foster @ 2021-03-01 4:54 ` Dave Chinner 2021-03-01 13:32 ` Brian Foster 0 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2021-03-01 4:54 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote: > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote: > > On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote: > > > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote: > > > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > > The problem that I talk about is introduced by this patch and also > > fixed by this patch. The problem does not exist until we have code > > that depends on an async CIL push guaranteeing that items in the CIL > > are unpinned. A stock xfs_log_force_seq(0) call guarantees that the > > log reaches disk by waiting on the CIL push then pushing the iclogs. > > All it does is elide waiting for the final iclogs to hit the disk. > > > > By eliding the "wait for CIL push" in an async xfs_log_force_seq() > > call, we also elide the "push iclogs" part of the log force and > > that's what causes the AIL problem. We don't push the iclogs, so the > > items committed to the last iclog don't get dispatched and the items > > never get unpinned because the checkpoint never completes. > > > > So in general it sounds like this patch broke async log force by > skipping the current iclog switch somehow or another and then this > change is incorporated to fix it (hence there is no outstanding bug > upstream). That makes a bit more sense. I'll get the detailed context > from the patches. Yes. > > As to the "root cause of the hang" I now know what that is too, but > > addressing it is outside the scope of this patchset. It's taken me a > > long time to identify this, but I've been seeing symptoms of it for > > quite a long time (i.e. years) but until doing this async CIL push > > work I've never been able to identify the cause. > > > > That is, the root cause is that the AIL cannot force a pinned buffer > > to disk, and a log force (of any kind) does not guarantee a pinned > > item is unpinned. That's because the CIL is pipelined, meaning that > > an item can be pinned in one CIL context, and while that is being > > pushed, the front end can relog that same item and pin it in the > > current CIL context. Hence it has two pins, not one. Now when the > > original log force completes, the pin from the first CIL context has > > been removed, but the pin from the current CIL context is still > > present. > > > > Somewhat related to the above point, ISTM our current behavior can lead > to scenarios where the potential window for elevated pin counts is > somewhat artificially widened. IOW, the log vector for a pincount == 1 > item might have been written out to disk, but the log item itself not > unpinned for some time later because the commit record is still sitting > in an active iclog. The unpin processing of the item doesn't occur until > that commit record iclog is synced out and I/O completes, which leaves > plenty of time for another pin to come along in the meantime (and the > scenario potentially repeats, starving out pincount == 0 state). So ISTM > the "damage" is done to some degree long before the AIL has any > opportunity to fix it, particularly if the only recourse it has is a log > force. > > (And of course there's a huge tradeoff between current behavior and > something that unconditionally synced out commit record iclogs, > particularly depending on CIL checkpoint sizes, so that's a broad > simplification and not a valid solution.) Yes, this can be an issue because when the single CIL work is CPU bound, it is the first iclog write of the checkpoint that pushes the previous commit record to disk. So, like you say, there's a natural pin overlap in this situation. > > Hence when we are relogging buffers over time (e.g. repeatedly > > modifying a directory), the AIL always sees the buffer pinned and > > cannot flush it. Even if the buffer has been moved to the delwri > > list (i.e. had a successful push) we still can't write it out > > because the delwri code skips pinned buffers. > > > > This iterative pin continues preventing the AIL from flushing the > > buffer until it pins the tail of the log. When the log runs out of > > space and blocks the modifying/relogging task(s), the log force the > > AIL issues allows the CIL to be pushed without the front end > > relogging the item and re-pinning the item. At the completion of the > > log force, the AIL can now push the unpinned buffer and hence move > > the tail of the log forward which frees up log reservation space. > > This then wakes the front end, which goes back to it's relogging > > game and the cycle repeats over. > > > > The description of the nested pin behavior makes sense. I've observed > the same thing recently when looking at some unrelated shutdown > problems. > > In any event, if I'm following correctly the deadlock issue was a > transient problem with this patch and not an upstream problem. However > there is still a pipeline stall/performance issue that manifests as > described above, and that preexists the proposed changes. Yes? Yes, AFAICT, this pin problem dates back to day zero. Before delayed logging it required the inode or buffer to be relogged in almost every iclog that was written (think a long running truncate removing hundreds of thousands of extents). But before delayed logging, this was very slow and limited by iclog IO so having the AIL stall would have largely been unnoticed. > > This is really nasty behaviour, and it's only recently that I've got > > a handle on it. I found it because my original "async CIL push" code > > resulted in long stalls every time the log is filled and the tail is > > pinned by a buffer that is being relogged in this manner.... > > > > I'm not sure how to fix this yet - the AIL needs to block the front > > end relogging to allow the buffer to be unpinned. Essentially, we > > need to hold the pinned items locked across a CIL push to guarantee > > they are unpinned, but that's the complete opposite of what the AIL > > currently does to prevent the front end from seeing long tail lock > > latencies when modifying stuff.... > > When this stall problem manifests, I'm assuming it's exacerbated by > delayed logging and the commit record behavior I described above. If > that's the case, could the AIL communicate writeback pressure through > affected log items such that checkpoints in which they are resident are > flushed out completely/immediately when the checkpoints occur? I suppose > that would require a log item flag or some such, which does raise a > concern of unnecessarily tagging many items (it's not clear to me how > likely that really is), but I'm curious if that would be an effective > POC at least.. I don't think we need to do anything like that. All we need to do to ensure that the AIL can flush a pinned buffer is to lock it, kick the log and wait for the pin count to go to zero. Then we can write it just fine, blocking only the front end transactions that need that buffer lock. Same goes for inodes, though xfs_iunpin_wait() already does this.... The only issue is determining if this causes unaaceptible long tail latencies on buffers like the AG headers.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing 2021-03-01 4:54 ` Dave Chinner @ 2021-03-01 13:32 ` Brian Foster 2021-03-03 1:23 ` Dave Chinner 0 siblings, 1 reply; 33+ messages in thread From: Brian Foster @ 2021-03-01 13:32 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote: > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote: > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote: > > > On Thu, Feb 25, 2021 at 08:12:12AM -0500, Brian Foster wrote: > > > > On Thu, Feb 25, 2021 at 08:10:58AM +1100, Dave Chinner wrote: > > > > > On Tue, Feb 23, 2021 at 04:32:11PM +1100, Dave Chinner wrote: > > > The problem that I talk about is introduced by this patch and also > > > fixed by this patch. The problem does not exist until we have code > > > that depends on an async CIL push guaranteeing that items in the CIL > > > are unpinned. A stock xfs_log_force_seq(0) call guarantees that the > > > log reaches disk by waiting on the CIL push then pushing the iclogs. > > > All it does is elide waiting for the final iclogs to hit the disk. > > > > > > By eliding the "wait for CIL push" in an async xfs_log_force_seq() > > > call, we also elide the "push iclogs" part of the log force and > > > that's what causes the AIL problem. We don't push the iclogs, so the > > > items committed to the last iclog don't get dispatched and the items > > > never get unpinned because the checkpoint never completes. > > > > > > > So in general it sounds like this patch broke async log force by > > skipping the current iclog switch somehow or another and then this > > change is incorporated to fix it (hence there is no outstanding bug > > upstream). That makes a bit more sense. I'll get the detailed context > > from the patches. > > Yes. > > > > As to the "root cause of the hang" I now know what that is too, but > > > addressing it is outside the scope of this patchset. It's taken me a > > > long time to identify this, but I've been seeing symptoms of it for > > > quite a long time (i.e. years) but until doing this async CIL push > > > work I've never been able to identify the cause. > > > > > > That is, the root cause is that the AIL cannot force a pinned buffer > > > to disk, and a log force (of any kind) does not guarantee a pinned > > > item is unpinned. That's because the CIL is pipelined, meaning that > > > an item can be pinned in one CIL context, and while that is being > > > pushed, the front end can relog that same item and pin it in the > > > current CIL context. Hence it has two pins, not one. Now when the > > > original log force completes, the pin from the first CIL context has > > > been removed, but the pin from the current CIL context is still > > > present. > > > > > > > Somewhat related to the above point, ISTM our current behavior can lead > > to scenarios where the potential window for elevated pin counts is > > somewhat artificially widened. IOW, the log vector for a pincount == 1 > > item might have been written out to disk, but the log item itself not > > unpinned for some time later because the commit record is still sitting > > in an active iclog. The unpin processing of the item doesn't occur until > > that commit record iclog is synced out and I/O completes, which leaves > > plenty of time for another pin to come along in the meantime (and the > > scenario potentially repeats, starving out pincount == 0 state). So ISTM > > the "damage" is done to some degree long before the AIL has any > > opportunity to fix it, particularly if the only recourse it has is a log > > force. > > > > (And of course there's a huge tradeoff between current behavior and > > something that unconditionally synced out commit record iclogs, > > particularly depending on CIL checkpoint sizes, so that's a broad > > simplification and not a valid solution.) > > Yes, this can be an issue because when the single CIL work is CPU > bound, it is the first iclog write of the checkpoint that pushes the > previous commit record to disk. So, like you say, there's a natural > pin overlap in this situation. > > > > Hence when we are relogging buffers over time (e.g. repeatedly > > > modifying a directory), the AIL always sees the buffer pinned and > > > cannot flush it. Even if the buffer has been moved to the delwri > > > list (i.e. had a successful push) we still can't write it out > > > because the delwri code skips pinned buffers. > > > > > > This iterative pin continues preventing the AIL from flushing the > > > buffer until it pins the tail of the log. When the log runs out of > > > space and blocks the modifying/relogging task(s), the log force the > > > AIL issues allows the CIL to be pushed without the front end > > > relogging the item and re-pinning the item. At the completion of the > > > log force, the AIL can now push the unpinned buffer and hence move > > > the tail of the log forward which frees up log reservation space. > > > This then wakes the front end, which goes back to it's relogging > > > game and the cycle repeats over. > > > > > > > The description of the nested pin behavior makes sense. I've observed > > the same thing recently when looking at some unrelated shutdown > > problems. > > > > In any event, if I'm following correctly the deadlock issue was a > > transient problem with this patch and not an upstream problem. However > > there is still a pipeline stall/performance issue that manifests as > > described above, and that preexists the proposed changes. Yes? > > Yes, AFAICT, this pin problem dates back to day zero. Before delayed > logging it required the inode or buffer to be relogged in almost > every iclog that was written (think a long running truncate removing > hundreds of thousands of extents). But before delayed logging, this > was very slow and limited by iclog IO so having the AIL stall would > have largely been unnoticed. > > > > This is really nasty behaviour, and it's only recently that I've got > > > a handle on it. I found it because my original "async CIL push" code > > > resulted in long stalls every time the log is filled and the tail is > > > pinned by a buffer that is being relogged in this manner.... > > > > > > I'm not sure how to fix this yet - the AIL needs to block the front > > > end relogging to allow the buffer to be unpinned. Essentially, we > > > need to hold the pinned items locked across a CIL push to guarantee > > > they are unpinned, but that's the complete opposite of what the AIL > > > currently does to prevent the front end from seeing long tail lock > > > latencies when modifying stuff.... > > > > When this stall problem manifests, I'm assuming it's exacerbated by > > delayed logging and the commit record behavior I described above. If > > that's the case, could the AIL communicate writeback pressure through > > affected log items such that checkpoints in which they are resident are > > flushed out completely/immediately when the checkpoints occur? I suppose > > that would require a log item flag or some such, which does raise a > > concern of unnecessarily tagging many items (it's not clear to me how > > likely that really is), but I'm curious if that would be an effective > > POC at least.. > > I don't think we need to do anything like that. All we need to do to > ensure that the AIL can flush a pinned buffer is to lock it, kick > the log and wait for the pin count to go to zero. Then we can write > it just fine, blocking only the front end transactions that need > that buffer lock. Same goes for inodes, though xfs_iunpin_wait() > already does this.... > Yeah, but why would we want to block xfsaild on a single item like that? Wouldn't holding the item locked like that just create a new stall point within xfsaild? Maybe I'm missing something, but what you describe here basically just sounds like a "lock the item and run a sync log force" pattern. I agree that the implementation I handwave about above seems overkill, but I'm really curious what the disadvantage of some form of logic in the logging code to do something like "if this checkpoint contains items that are AIL resident && under some measure of writeback pressure && the next iclog is active, want_sync the commit record iclog on background push to reduce item pin count times" might be. Brian > The only issue is determining if this causes unaaceptible long tail > latencies on buffers like the AG headers.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing 2021-03-01 13:32 ` Brian Foster @ 2021-03-03 1:23 ` Dave Chinner 2021-03-03 17:20 ` Brian Foster 0 siblings, 1 reply; 33+ messages in thread From: Dave Chinner @ 2021-03-03 1:23 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Mon, Mar 01, 2021 at 08:32:18AM -0500, Brian Foster wrote: > On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote: > > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote: > > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote: > > > > This is really nasty behaviour, and it's only recently that I've got > > > > a handle on it. I found it because my original "async CIL push" code > > > > resulted in long stalls every time the log is filled and the tail is > > > > pinned by a buffer that is being relogged in this manner.... > > > > > > > > I'm not sure how to fix this yet - the AIL needs to block the front > > > > end relogging to allow the buffer to be unpinned. Essentially, we > > > > need to hold the pinned items locked across a CIL push to guarantee > > > > they are unpinned, but that's the complete opposite of what the AIL > > > > currently does to prevent the front end from seeing long tail lock > > > > latencies when modifying stuff.... > > > > > > When this stall problem manifests, I'm assuming it's exacerbated by > > > delayed logging and the commit record behavior I described above. If > > > that's the case, could the AIL communicate writeback pressure through > > > affected log items such that checkpoints in which they are resident are > > > flushed out completely/immediately when the checkpoints occur? I suppose > > > that would require a log item flag or some such, which does raise a > > > concern of unnecessarily tagging many items (it's not clear to me how > > > likely that really is), but I'm curious if that would be an effective > > > POC at least.. > > > > I don't think we need to do anything like that. All we need to do to > > ensure that the AIL can flush a pinned buffer is to lock it, kick > > the log and wait for the pin count to go to zero. Then we can write > > it just fine, blocking only the front end transactions that need > > that buffer lock. Same goes for inodes, though xfs_iunpin_wait() > > already does this.... > > > > Yeah, but why would we want to block xfsaild on a single item like that? Who said anything about blocking the AIL on a single item? :) > Wouldn't holding the item locked like that just create a new stall point > within xfsaild? Maybe I'm missing something, but what you describe here > basically just sounds like a "lock the item and run a sync log force" > pattern. What I was thinking is that if the item is pinned and at the tail of the log, then we leave it locked when we flush it rather than unlocking it and relocking it when the delwri submit code gets to it. If it gets unpinned before the delwri code gets to it, all good. If not, the delwri code being unable to flush it will feed back up into the AIL to trigger a log force, which will unpin it in the near future and it will be written on the next AIL delwri submit cycle. The thing we need to be careful of this is minimising the lock hold time by the AIL while we unpin the item. That's the "long tail latencies" problem I mention above. Essentially, we need to try to avoid holding the item locked for a long period of time before issuing the log force and/or resubmitting it for IO once it is unlocked. I have a few ideas on this, but no patches yet. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing 2021-03-03 1:23 ` Dave Chinner @ 2021-03-03 17:20 ` Brian Foster 2021-03-04 2:01 ` Dave Chinner 0 siblings, 1 reply; 33+ messages in thread From: Brian Foster @ 2021-03-03 17:20 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Wed, Mar 03, 2021 at 12:23:04PM +1100, Dave Chinner wrote: > On Mon, Mar 01, 2021 at 08:32:18AM -0500, Brian Foster wrote: > > On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote: > > > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote: > > > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote: > > > > > This is really nasty behaviour, and it's only recently that I've got > > > > > a handle on it. I found it because my original "async CIL push" code > > > > > resulted in long stalls every time the log is filled and the tail is > > > > > pinned by a buffer that is being relogged in this manner.... > > > > > > > > > > I'm not sure how to fix this yet - the AIL needs to block the front > > > > > end relogging to allow the buffer to be unpinned. Essentially, we > > > > > need to hold the pinned items locked across a CIL push to guarantee > > > > > they are unpinned, but that's the complete opposite of what the AIL > > > > > currently does to prevent the front end from seeing long tail lock > > > > > latencies when modifying stuff.... > > > > > > > > When this stall problem manifests, I'm assuming it's exacerbated by > > > > delayed logging and the commit record behavior I described above. If > > > > that's the case, could the AIL communicate writeback pressure through > > > > affected log items such that checkpoints in which they are resident are > > > > flushed out completely/immediately when the checkpoints occur? I suppose > > > > that would require a log item flag or some such, which does raise a > > > > concern of unnecessarily tagging many items (it's not clear to me how > > > > likely that really is), but I'm curious if that would be an effective > > > > POC at least.. > > > > > > I don't think we need to do anything like that. All we need to do to > > > ensure that the AIL can flush a pinned buffer is to lock it, kick > > > the log and wait for the pin count to go to zero. Then we can write > > > it just fine, blocking only the front end transactions that need > > > that buffer lock. Same goes for inodes, though xfs_iunpin_wait() > > > already does this.... > > > > > > > Yeah, but why would we want to block xfsaild on a single item like that? > > Who said anything about blocking the AIL on a single item? :) > > > Wouldn't holding the item locked like that just create a new stall point > > within xfsaild? Maybe I'm missing something, but what you describe here > > basically just sounds like a "lock the item and run a sync log force" > > pattern. > > What I was thinking is that if the item is pinned and at the > tail of the log, then we leave it locked when we flush it rather > than unlocking it and relocking it when the delwri submit code gets > to it. If it gets unpinned before the delwri code gets to it, all > good. If not, the delwri code being unable to flush it will feed > back up into the AIL to trigger a log force, which will unpin it > in the near future and it will be written on the next AIL delwri > submit cycle. > I'm not sure what you mean by leaving the item locked when we flush it if it is pinned, since we don't flush pinned items. Perhaps implied is that this would trylock the buffer first, then only worry about if it's pinned if we acquire the lock. If so (and the pin is at the tail), hold the lock and kick the log as a means to ensure that xfsaild is guaranteed next access to the buffer once unpinned. Hm? IIUC, that seems interesting. Though as noted in the flush optimization series, I am a little leery about issuing (even async) log forces with locks held, at least with the current implementation.. Brian > The thing we need to be careful of this is minimising the lock hold > time by the AIL while we unpin the item. That's the "long tail > latencies" problem I mention above. Essentially, we need to try to > avoid holding the item locked for a long period of time before > issuing the log force and/or resubmitting it for IO once it is > unlocked. I have a few ideas on this, but no patches yet. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing 2021-03-03 17:20 ` Brian Foster @ 2021-03-04 2:01 ` Dave Chinner 0 siblings, 0 replies; 33+ messages in thread From: Dave Chinner @ 2021-03-04 2:01 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Wed, Mar 03, 2021 at 12:20:19PM -0500, Brian Foster wrote: > On Wed, Mar 03, 2021 at 12:23:04PM +1100, Dave Chinner wrote: > > On Mon, Mar 01, 2021 at 08:32:18AM -0500, Brian Foster wrote: > > > On Mon, Mar 01, 2021 at 03:54:22PM +1100, Dave Chinner wrote: > > > > On Sat, Feb 27, 2021 at 11:25:06AM -0500, Brian Foster wrote: > > > > > On Fri, Feb 26, 2021 at 09:03:05AM +1100, Dave Chinner wrote: > > > > > > This is really nasty behaviour, and it's only recently that I've got > > > > > > a handle on it. I found it because my original "async CIL push" code > > > > > > resulted in long stalls every time the log is filled and the tail is > > > > > > pinned by a buffer that is being relogged in this manner.... > > > > > > > > > > > > I'm not sure how to fix this yet - the AIL needs to block the front > > > > > > end relogging to allow the buffer to be unpinned. Essentially, we > > > > > > need to hold the pinned items locked across a CIL push to guarantee > > > > > > they are unpinned, but that's the complete opposite of what the AIL > > > > > > currently does to prevent the front end from seeing long tail lock > > > > > > latencies when modifying stuff.... > > > > > > > > > > When this stall problem manifests, I'm assuming it's exacerbated by > > > > > delayed logging and the commit record behavior I described above. If > > > > > that's the case, could the AIL communicate writeback pressure through > > > > > affected log items such that checkpoints in which they are resident are > > > > > flushed out completely/immediately when the checkpoints occur? I suppose > > > > > that would require a log item flag or some such, which does raise a > > > > > concern of unnecessarily tagging many items (it's not clear to me how > > > > > likely that really is), but I'm curious if that would be an effective > > > > > POC at least.. > > > > > > > > I don't think we need to do anything like that. All we need to do to > > > > ensure that the AIL can flush a pinned buffer is to lock it, kick > > > > the log and wait for the pin count to go to zero. Then we can write > > > > it just fine, blocking only the front end transactions that need > > > > that buffer lock. Same goes for inodes, though xfs_iunpin_wait() > > > > already does this.... > > > > > > > > > > Yeah, but why would we want to block xfsaild on a single item like that? > > > > Who said anything about blocking the AIL on a single item? :) > > > > > Wouldn't holding the item locked like that just create a new stall point > > > within xfsaild? Maybe I'm missing something, but what you describe here > > > basically just sounds like a "lock the item and run a sync log force" > > > pattern. > > > > What I was thinking is that if the item is pinned and at the > > tail of the log, then we leave it locked when we flush it rather > > than unlocking it and relocking it when the delwri submit code gets > > to it. If it gets unpinned before the delwri code gets to it, all > > good. If not, the delwri code being unable to flush it will feed > > back up into the AIL to trigger a log force, which will unpin it > > in the near future and it will be written on the next AIL delwri > > submit cycle. > > > > I'm not sure what you mean by leaving the item locked when we flush it > if it is pinned, since we don't flush pinned items. That's exactly what I'm talking about changing. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH 3/3] xfs: CIL work is serialised, not pipelined 2021-02-23 5:32 [PATCH 0/3] xfs: CIL improvements Dave Chinner 2021-02-23 5:32 ` [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner 2021-02-23 5:32 ` [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing Dave Chinner @ 2021-02-23 5:32 ` Dave Chinner 2 siblings, 0 replies; 33+ messages in thread From: Dave Chinner @ 2021-02-23 5:32 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> Because we use a single work structure attached to the CIL rather than the CIL context, we can only queue a single work item at a time. This results in the CIL being single threaded and limits performance when it becomes CPU bound. The design of the CIL is that it is pipelined and multiple commits can be running concurrently, but the way the work is currently implemented means that it is not pipelining as it was intended. The critical work to switch the CIL context can take a few milliseconds to run, but the rest of the CIL context flush can take hundreds of milliseconds to complete. The context switching is the serialisation point of the CIL, once the context has been switched the rest of the context push can run asynchrnously with all other context pushes. Hence we can move the work to the CIL context so that we can run multiple CIL pushes at the same time and spread the majority of the work out over multiple CPUs. We can keep the per-cpu CIL commit state on the CIL rather than the context, because the context is pinned to the CIL until the switch is done and we aggregate and drain the per-cpu state held on the CIL during the context switch. However, because we no longer serialise the CIL work, we can have effectively unlimited CIL pushes in progress. We don't want to do this - not only does it create contention on the iclogs and the state machine locks, we can run the log right out of space with outstanding pushes. Instead, limit the work concurrency to 4 concurrent works being processed at a time. THis is enough concurrency to remove the CIL from being a CPU bound bottleneck but not enough to create new contention points or unbound concurrency issues. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log_cil.c | 80 +++++++++++++++++++++++-------------------- fs/xfs/xfs_log_priv.h | 2 +- fs/xfs/xfs_super.c | 2 +- 3 files changed, 44 insertions(+), 40 deletions(-) diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index be6c91d7218f..54d3fabd9a5b 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -47,6 +47,34 @@ xlog_cil_ticket_alloc( return tic; } +/* + * Unavoidable forward declaration - xlog_cil_push_work() calls + * xlog_cil_ctx_alloc() itself. + */ +static void xlog_cil_push_work(struct work_struct *work); + +static struct xfs_cil_ctx * +xlog_cil_ctx_alloc(void) +{ + struct xfs_cil_ctx *ctx; + + ctx = kmem_zalloc(sizeof(*ctx), KM_NOFS); + INIT_LIST_HEAD(&ctx->committing); + INIT_LIST_HEAD(&ctx->busy_extents); + INIT_WORK(&ctx->push_work, xlog_cil_push_work); + return ctx; +} + +static void +xlog_cil_ctx_switch( + struct xfs_cil *cil, + struct xfs_cil_ctx *ctx) +{ + ctx->sequence = ++cil->xc_current_sequence; + ctx->cil = cil; + cil->xc_ctx = ctx; +} + /* * After the first stage of log recovery is done, we know where the head and * tail of the log are. We need this log initialisation done before we can @@ -641,11 +669,11 @@ static void xlog_cil_push_work( struct work_struct *work) { - struct xfs_cil *cil = - container_of(work, struct xfs_cil, xc_push_work); + struct xfs_cil_ctx *ctx = + container_of(work, struct xfs_cil_ctx, push_work); + struct xfs_cil *cil = ctx->cil; struct xlog *log = cil->xc_log; struct xfs_log_vec *lv; - struct xfs_cil_ctx *ctx; struct xfs_cil_ctx *new_ctx; struct xlog_in_core *commit_iclog; struct xlog_ticket *tic; @@ -658,11 +686,10 @@ xlog_cil_push_work( xfs_lsn_t push_seq; DECLARE_COMPLETION_ONSTACK(bdev_flush); - new_ctx = kmem_zalloc(sizeof(*new_ctx), KM_NOFS); + new_ctx = xlog_cil_ctx_alloc(); new_ctx->ticket = xlog_cil_ticket_alloc(log); down_write(&cil->xc_ctx_lock); - ctx = cil->xc_ctx; spin_lock(&cil->xc_push_lock); push_seq = cil->xc_push_seq; @@ -692,7 +719,7 @@ xlog_cil_push_work( /* check for a previously pushed sequence */ - if (push_seq < cil->xc_ctx->sequence) { + if (push_seq < ctx->sequence) { spin_unlock(&cil->xc_push_lock); goto out_skip; } @@ -762,19 +789,7 @@ xlog_cil_push_work( } /* - * initialise the new context and attach it to the CIL. Then attach - * the current context to the CIL committing list so it can be found - * during log forces to extract the commit lsn of the sequence that - * needs to be forced. - */ - INIT_LIST_HEAD(&new_ctx->committing); - INIT_LIST_HEAD(&new_ctx->busy_extents); - new_ctx->sequence = ctx->sequence + 1; - new_ctx->cil = cil; - cil->xc_ctx = new_ctx; - - /* - * The switch is now done, so we can drop the context lock and move out + * Switch the contexts so we can drop the context lock and move out * of a shared context. We can't just go straight to the commit record, * though - we need to synchronise with previous and future commits so * that the commit records are correctly ordered in the log to ensure @@ -799,7 +814,7 @@ xlog_cil_push_work( * deferencing a freed context pointer. */ spin_lock(&cil->xc_push_lock); - cil->xc_current_sequence = new_ctx->sequence; + xlog_cil_ctx_switch(cil, new_ctx); spin_unlock(&cil->xc_push_lock); up_write(&cil->xc_ctx_lock); @@ -957,7 +972,7 @@ xlog_cil_push_background( spin_lock(&cil->xc_push_lock); if (cil->xc_push_seq < cil->xc_current_sequence) { cil->xc_push_seq = cil->xc_current_sequence; - queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); + queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work); } /* @@ -1011,7 +1026,7 @@ xlog_cil_push_now( /* start on any pending background push to minimise wait time on it */ if (flush) - flush_work(&cil->xc_push_work); + flush_workqueue(log->l_mp->m_cil_workqueue); /* * If the CIL is empty or we've already pushed the sequence then @@ -1024,7 +1039,7 @@ xlog_cil_push_now( } cil->xc_push_seq = push_seq; - queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work); + queue_work(log->l_mp->m_cil_workqueue, &cil->xc_ctx->push_work); spin_unlock(&cil->xc_push_lock); } @@ -1259,13 +1274,6 @@ xlog_cil_init( if (!cil) return -ENOMEM; - ctx = kmem_zalloc(sizeof(*ctx), KM_MAYFAIL); - if (!ctx) { - kmem_free(cil); - return -ENOMEM; - } - - INIT_WORK(&cil->xc_push_work, xlog_cil_push_work); INIT_LIST_HEAD(&cil->xc_cil); INIT_LIST_HEAD(&cil->xc_committing); spin_lock_init(&cil->xc_cil_lock); @@ -1273,16 +1281,12 @@ xlog_cil_init( init_waitqueue_head(&cil->xc_push_wait); init_rwsem(&cil->xc_ctx_lock); init_waitqueue_head(&cil->xc_commit_wait); - - INIT_LIST_HEAD(&ctx->committing); - INIT_LIST_HEAD(&ctx->busy_extents); - ctx->sequence = 1; - ctx->cil = cil; - cil->xc_ctx = ctx; - cil->xc_current_sequence = ctx->sequence; - cil->xc_log = log; log->l_cilp = cil; + + ctx = xlog_cil_ctx_alloc(); + xlog_cil_ctx_switch(cil, ctx); + return 0; } diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index d9929d9146e4..c0058c1f0c81 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -245,6 +245,7 @@ struct xfs_cil_ctx { struct list_head iclog_entry; struct list_head committing; /* ctx committing list */ struct work_struct discard_endio_work; + struct work_struct push_work; }; /* @@ -276,7 +277,6 @@ struct xfs_cil { struct list_head xc_committing; wait_queue_head_t xc_commit_wait; uint64_t xc_current_sequence; - struct work_struct xc_push_work; wait_queue_head_t xc_push_wait; /* background push throttle */ } ____cacheline_aligned_in_smp; diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 85dd9593b40b..343499fdcbf3 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -502,7 +502,7 @@ xfs_init_mount_workqueues( mp->m_cil_workqueue = alloc_workqueue("xfs-cil/%s", XFS_WQFLAGS(WQ_FREEZABLE | WQ_MEM_RECLAIM | WQ_UNBOUND), - 0, mp->m_super->s_id); + 4, mp->m_super->s_id); if (!mp->m_cil_workqueue) goto out_destroy_unwritten; -- 2.28.0 ^ permalink raw reply related [flat|nested] 33+ messages in thread
end of thread, other threads:[~2021-03-11 12:43 UTC | newest] Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-02-23 5:32 [PATCH 0/3] xfs: CIL improvements Dave Chinner 2021-02-23 5:32 ` [PATCH 1/3] xfs: xfs_log_force_lsn isn't passed a LSN Dave Chinner 2021-02-24 21:42 ` Darrick J. Wong 2021-02-24 22:19 ` Dave Chinner 2021-02-25 19:01 ` Christoph Hellwig 2021-03-02 18:12 ` Brian Foster 2021-02-23 5:32 ` [PATCH 2/3] xfs: AIL needs asynchronous CIL forcing Dave Chinner 2021-02-24 21:10 ` Dave Chinner 2021-02-24 23:26 ` [PATCH 2/3 v2] " Dave Chinner 2021-02-25 2:15 ` Dave Chinner 2021-03-02 21:44 ` Brian Foster 2021-03-03 0:57 ` Dave Chinner 2021-03-03 17:32 ` Brian Foster 2021-03-04 1:59 ` Dave Chinner 2021-03-04 13:13 ` Brian Foster 2021-03-04 22:48 ` Dave Chinner 2021-03-05 14:58 ` Brian Foster 2021-03-09 0:44 ` Dave Chinner 2021-03-09 4:35 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Darrick J. Wong 2021-03-10 2:10 ` Brian Foster 2021-03-10 22:00 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing Dave Chinner 2021-03-10 15:13 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing\ Brian Foster 2021-03-11 12:41 ` Christoph Hellwig 2021-03-10 14:49 ` [PATCH 2/3 v2] xfs: AIL needs asynchronous CIL forcing Brian Foster 2021-02-25 13:12 ` [PATCH 2/3] " Brian Foster 2021-02-25 22:03 ` Dave Chinner 2021-02-27 16:25 ` Brian Foster 2021-03-01 4:54 ` Dave Chinner 2021-03-01 13:32 ` Brian Foster 2021-03-03 1:23 ` Dave Chinner 2021-03-03 17:20 ` Brian Foster 2021-03-04 2:01 ` Dave Chinner 2021-02-23 5:32 ` [PATCH 3/3] xfs: CIL work is serialised, not pipelined Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.