linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Allison Henderson <allison.henderson@oracle.com>
To: Dave Chinner <david@fromorbit.com>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 6/7] [RFC] xfs: intent item whiteouts
Date: Fri, 3 Sep 2021 14:09:58 -0700	[thread overview]
Message-ID: <67e48ab2-bf23-cff1-7793-6e6b4fa1dd70@oracle.com> (raw)
In-Reply-To: <20210902095927.911100-7-david@fromorbit.com>



On 9/2/21 2:59 AM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we log modifications based on intents, we add both intent
> and intent done items to the modification being made. These get
> written to the log to ensure that the operation is re-run if the
> intent done is not found in the log.
> 
> However, for operations that complete wholly within a single
> checkpoint, the change in the checkpoint is atomic and will never
> need replay. In this case, we don't need to actually write the
> intent and intent done items to the journal because log recovery
> will never need to manually restart this modification.
> 
> Log recovery currently handles intent/intent done matching by
> inserting the intent into the AIL, then removing it when a matching
> intent done item is found. Hence for all the intent-based operations
> that complete within a checkpoint, we spend all that time parsing
> the intent/intent done items just to cancel them and do nothing with
> them.
> 
> Hence it follows that the only time we actually need intents in the
> log is when the modification crosses checkpoint boundaries in the
> log and so may only be partially complete in the journal. Hence if
> we commit and intent done item to the CIL and the intent item is in
> the same checkpoint, we don't actually have to write them to the
> journal because log recovery will always cancel the intents.
> 
> We've never really worried about the overhead of logging intents
> unnecessarily like this because the intents we log are generally
> very much smaller than the change being made. e.g. freeing an extent
> involves modifying at lease two freespace btree blocks and the AGF,
> so the EFI/EFD overhead is only a small increase in space and
> processing time compared to the overall cost of freeing an extent.
> 
> However, delayed attributes change this cost equation dramatically,
> especially for inline attributes. In the case of adding an inline
> attribute, we only log the inode core and attribute fork at present.
> With delayed attributes, we now log the attr intent which includes
> the name and value, the inode core adn attr fork, and finally the
> attr intent done item. We increase the number of items we log from 1
> to 3, and the number of log vectors (regions) goes up from 3 to 7.
> Hence we tripple the number of objects that the CIL has to process,
> and more than double the number of log vectors that need to be
> written to the journal.
> 
> At scale, this means delayed attributes cause a non-pipelined CIL to
> become CPU bound processing all the extra items, resulting in a > 40%
> performance degradation on 16-way file+xattr create worklaods.
> Pipelining the CIL (as per 5.15) reduces the performance degradation
> to 20%, but now the limitation is the rate at which the log items
> can be written to the iclogs and iclogs be dispatched for IO and
> completed.
> 
> Even log IO completion is slowed down by these intents, because it
> now has to process 3x the number of items in the checkpoint.
> Processing completed intents is especially inefficient here, because
> we first insert the intent into the AIL, then remove it from the AIL
> when the intent done is processed. IOWs, we are also doing expensive
> operations in log IO completion we could completely avoid if we
> didn't log completed intent/intent done pairs.
> 
> Enter log item whiteouts.
> 
> When an intent done is committed, we can check to see if the
> associated intent is in the same checkpoint as we are currently
> committing the intent done to. If so, we can mark the intent log
> item with a whiteout and immediately free the intent done item
> rather than committing it to the CIL. We can basically skip the
> entire formatting and CIL insertion steps for the intent done item.
> 
> However, we cannot remove the intent item from the CIL at this point
> because the unlocked per-cpu CIL item lists do not permit removal
> without holding the CIL context lock exclusively. Transaction commit
> only holds the context lock shared, hence the best we can do is mark
> the intent item with a whiteout so that the CIL push can release it
> rather than writing it to the log.
> 
> This means we never write the intent to the log if the intent done
> has also been committed to the same checkpoint, but we'll always
> write the intent if the intent done has not been committed or has
> been committed to a different checkpoint. This will result in
> correct log recovery behaviour in all cases, without the overhead of
> logging unnecessary intents.
> 
> This intent whiteout concept is generic - we can apply it to all
> intent/intent done pairs that have a direct 1:1 relationship. The
> way deferred ops iterate and relog intents mean that all intents
> currently have a 1:1 relationship with their done intent, and hence
> we can apply this cancellation to all existing intent/intent done
> implementations.
> 
> For delayed attributes with a 16-way 64kB xattr create workload,
> whiteouts reduce the amount of journalled metadata from ~2.5GB/s
> down to ~600MB/s and improve the creation rate from 9000/s to
> 14000/s.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Ok, this makes a lot of sense.  Thanks for the detailed explanation, it 
helps a lot.  I think we need to do everything we can to optimize things 
as much as possible.  Since the over all goal is parent pointers, attr 
activity will increase quite a bit.  So I would support this change.

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_log_cil.c | 78 +++++++++++++++++++++++++++++++++++++++++---
>   fs/xfs/xfs_trace.h   |  3 ++
>   fs/xfs/xfs_trans.h   |  7 ++--
>   3 files changed, 82 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index bd2c8178255e..fff68aad254e 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -502,7 +502,8 @@ xlog_cil_insert_format_items(
>   static void
>   xlog_cil_insert_items(
>   	struct xlog		*log,
> -	struct xfs_trans	*tp)
> +	struct xfs_trans	*tp,
> +	uint32_t		released_space)
>   {
>   	struct xfs_cil		*cil = log->l_cilp;
>   	struct xfs_cil_ctx	*ctx = cil->xc_ctx;
> @@ -569,6 +570,7 @@ xlog_cil_insert_items(
>   	 */
>   	cilpcp = get_cpu_ptr(cil->xc_pcp);
>   	cilpcp->space_reserved += ctx_res;
> +	cilpcp->space_used -= released_space;
>   	cilpcp->space_used += len;
>   	if (space_used >= XLOG_CIL_SPACE_LIMIT(log) ||
>   	    cilpcp->space_used >
> @@ -1028,11 +1030,14 @@ xlog_cil_order_cmp(
>   }
>   
>   /*
> - * Build a log vector chain from the current CIL.
> + * Build a log vector chain from the current CIL. If a log item is marked with a
> + * whiteout, we do not need to write it to the journal and so we just move them
> + * to the whiteout list for the caller to dispose of appropriately.
>    */
>   static void
>   xlog_cil_build_lv_chain(
>   	struct xfs_cil_ctx	*ctx,
> +	struct list_head	*whiteouts,
>   	uint32_t		*num_iovecs,
>   	uint32_t		*num_bytes)
>   {
> @@ -1043,6 +1048,11 @@ xlog_cil_build_lv_chain(
>   
>   		item = list_first_entry(&ctx->log_items,
>   					struct xfs_log_item, li_cil);
> +		if (test_bit(XFS_LI_WHITEOUT, &item->li_flags)) {
> +			list_move(&item->li_cil, whiteouts);
> +			trace_xfs_cil_whiteout_skip(item);
> +			continue;
> +		}
>   
>   		lv = item->li_lv;
>   		lv->lv_order_id = item->li_order_id;
> @@ -1092,6 +1102,7 @@ xlog_cil_push_work(
>   	DECLARE_COMPLETION_ONSTACK(bdev_flush);
>   	bool			push_commit_stable;
>   	struct xlog_ticket	*ticket;
> +	LIST_HEAD		(whiteouts);
>   
>   	new_ctx = xlog_cil_ctx_alloc();
>   	new_ctx->ticket = xlog_cil_ticket_alloc(log);
> @@ -1178,7 +1189,7 @@ xlog_cil_push_work(
>   				&bdev_flush);
>   
>   	xlog_cil_pcp_aggregate(cil, ctx);
> -	xlog_cil_build_lv_chain(ctx, &num_iovecs, &num_bytes);
> +	xlog_cil_build_lv_chain(ctx, &whiteouts, &num_iovecs, &num_bytes);
>   
>   	/*
>   	 * Switch the contexts so we can drop the context lock and move out
> @@ -1312,6 +1323,15 @@ xlog_cil_push_work(
>   	/* Not safe to reference ctx now! */
>   
>   	spin_unlock(&log->l_icloglock);
> +
> +	/* clean up log items that had whiteouts */
> +	while (!list_empty(&whiteouts)) {
> +		struct xfs_log_item *item = list_first_entry(&whiteouts,
> +						struct xfs_log_item, li_cil);
> +		list_del_init(&item->li_cil);
> +		trace_xfs_cil_whiteout_unpin(item);
> +		item->li_ops->iop_unpin(item, 1);
> +	}
>   	xfs_log_ticket_ungrant(log, ticket);
>   	return;
>   
> @@ -1323,6 +1343,14 @@ xlog_cil_push_work(
>   
>   out_abort_free_ticket:
>   	ASSERT(xlog_is_shutdown(log));
> +	while (!list_empty(&whiteouts)) {
> +		struct xfs_log_item *item = list_first_entry(&whiteouts,
> +						struct xfs_log_item, li_cil);
> +		list_del_init(&item->li_cil);
> +		trace_xfs_cil_whiteout_unpin(item);
> +		item->li_ops->iop_unpin(item, 1);
> +	}
> +
>   	if (!ctx->commit_iclog) {
>   		xfs_log_ticket_ungrant(log, ctx->ticket);
>   		xlog_cil_committed(ctx);
> @@ -1475,6 +1503,43 @@ xlog_cil_empty(
>   	return empty;
>   }
>   
> +/*
> + * If there are intent done items in this transaction and the related intent was
> + * committed in the current (same) CIL checkpoint, we don't need to write either
> + * the intent or intent done item to the journal as the change will be
> + * journalled atomically within this checkpoint. As we cannot remove items from
> + * the CIL here, mark the related intent with a whiteout so that the CIL push
> + * can remove it rather than writing it to the journal. Then remove the intent
> + * done item from the current transaction and release it so it doesn't get put
> + * into the CIL at all.
> + */
> +static uint32_t
> +xlog_cil_process_intents(
> +	struct xfs_cil		*cil,
> +	struct xfs_trans	*tp)
> +{
> +	struct xfs_log_item	*lip, *ilip, *next;
> +	uint32_t		len = 0;
> +
> +	list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
> +		if (!(lip->li_ops->flags & XFS_ITEM_INTENT_DONE))
> +			continue;
> +
> +		ilip = lip->li_ops->iop_intent(lip);
> +		if (!ilip || !xlog_item_in_current_chkpt(cil, ilip))
> +			continue;
> +		set_bit(XFS_LI_WHITEOUT, &ilip->li_flags);
> +		trace_xfs_cil_whiteout_mark(ilip);
> +		len += ilip->li_lv->lv_bytes;
> +		kmem_free(ilip->li_lv);
> +		ilip->li_lv = NULL;
> +
> +		xfs_trans_del_item(lip);
> +		lip->li_ops->iop_release(lip);
> +	}
> +	return len;
> +}
> +
>   /*
>    * Commit a transaction with the given vector to the Committed Item List.
>    *
> @@ -1497,6 +1562,7 @@ xlog_cil_commit(
>   {
>   	struct xfs_cil		*cil = log->l_cilp;
>   	struct xfs_log_item	*lip, *next;
> +	uint32_t		released_space = 0;
>   
>   	/*
>   	 * Do all necessary memory allocation before we lock the CIL.
> @@ -1508,7 +1574,11 @@ xlog_cil_commit(
>   	/* lock out background commit */
>   	down_read(&cil->xc_ctx_lock);
>   
> -	xlog_cil_insert_items(log, tp);
> +	if (tp->t_flags & XFS_TRANS_HAS_INTENT_DONE)
> +		released_space = xlog_cil_process_intents(cil, tp);
> +
> +	xlog_cil_insert_items(log, tp, released_space);
> +	tp->t_ticket->t_curr_res += released_space;
>   
>   	if (regrant && !xlog_is_shutdown(log))
>   		xfs_log_ticket_regrant(log, tp->t_ticket);
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 77a78b5b1a29..d4f5a1410879 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1348,6 +1348,9 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
>   DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip);
> +DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin);
>   
>   DECLARE_EVENT_CLASS(xfs_ail_class,
>   	TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index a6d7b3309bd7..5765ca6e2c84 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -55,13 +55,16 @@ struct xfs_log_item {
>   #define	XFS_LI_IN_AIL	0
>   #define	XFS_LI_ABORTED	1
>   #define	XFS_LI_FAILED	2
> -#define	XFS_LI_DIRTY	3	/* log item dirty in transaction */
> +#define	XFS_LI_DIRTY	3
> +#define	XFS_LI_WHITEOUT	4
>   
>   #define XFS_LI_FLAGS \
>   	{ (1 << XFS_LI_IN_AIL),		"IN_AIL" }, \
>   	{ (1 << XFS_LI_ABORTED),	"ABORTED" }, \
>   	{ (1 << XFS_LI_FAILED),		"FAILED" }, \
> -	{ (1 << XFS_LI_DIRTY),		"DIRTY" }
> +	{ (1 << XFS_LI_DIRTY),		"DIRTY" }, \
> +	{ (1 << XFS_LI_WHITEOUT),	"WHITEOUT" }
> +
>   
>   struct xfs_item_ops {
>   	unsigned flags;
> 

  reply	other threads:[~2021-09-03 21:10 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-02  9:59 [RFC PATCH 0/7] " Dave Chinner
2021-09-02  9:59 ` [PATCH 1/7] xfs: add log item flags to indicate intents Dave Chinner
2021-09-03 21:08   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 2/7] xfs: tag transactions that contain intent done items Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 3/7] xfs: factor a move some code in xfs_log_cil.c Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 4/7] xfs: add log item method to return related intents Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 5/7] xfs: whiteouts release intents that are not in the AIL Dave Chinner
2021-09-03 21:09   ` Allison Henderson
2021-09-02  9:59 ` [PATCH 6/7] [RFC] xfs: intent item whiteouts Dave Chinner
2021-09-03 21:09   ` Allison Henderson [this message]
2021-09-02  9:59 ` [PATCH 7/7] xfs: reduce kvmalloc overhead for CIL shadow buffers Dave Chinner
2021-09-03 21:55   ` Allison Henderson
2021-09-09 11:37 ` [RFC PATCH 0/7] xfs: intent item whiteouts Christoph Hellwig
2021-09-09 21:21   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=67e48ab2-bf23-cff1-7793-6e6b4fa1dd70@oracle.com \
    --to=allison.henderson@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    --subject='Re: [PATCH 6/7] [RFC] xfs: intent item whiteouts' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).