All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 2/4] mkfs: rework AG header initialisation ordering
Date: Thu, 6 Sep 2018 09:31:24 -0400	[thread overview]
Message-ID: <20180906133123.GB3311@bfoster> (raw)
In-Reply-To: <20180905081932.27478-3-david@fromorbit.com>

On Wed, Sep 05, 2018 at 06:19:30PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When observing the behaviour of an 8EB mkfs execution, I noticed
> that a phase where there are a massive number of read/modify/write
> cycles occurring. I didn't wait for it to complete - it was obvious
> that it was after all the AG headers had been written. That left the
> AGFL initialisation as the likely cause.
> 
> When all the AG headers don't fit in the libxfs buffer cache, the
> AGFL init requires re-reading the AGF, the AGFL, the free space tree
> root blocks and the rmap tree root block. They all then get
> modified and written back out. 10 IOs per AG. When you have 8
> million AGs, that's a lot of extra IO.
> 
> Change the initialisation algorithm to initialise the AGFL
> immediately after initialising the rest of the headers and
> calculating the minimum AGFL size for that AG. This means the
> modifications will all hit the buffer cache and this will remove the
> IO penalty.
> 
> The "worst_freelist" size calculation doesn't change from AG to AG -
> it's based on the physical configuration of the AG, and all AGs have
> the same configuration. hence we only need to calculate this once,
> not for every AG. That allows us to initialise the AGFL immediately
> after the rest of the AG has been initialised rather than in a
> separate pass.
> 
> TIme to make a filesystem from scratch, using a zeroed device so the
> force overwrite algorithms are not triggered and -K to avoid
> discards:
> 
> FS size		10PB	100PB	 1EB
> current mkfs	26.9s	214.8s	2484s
> patched		11.3s	 70.3s	 709s
> 
> In both cases, the IO profile looks identical for the initial AG
> header writeout loop. The difference is that the old code then
> does the RMW loop to init the AGFL, and that runs at about half the
> speed. Hence runtime of the new code is reduce by around 65-70%
> simply by avoiding all that IO.
> 
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

This one seems like it stands alone as a nice fixup. Were you planning
to send this independently as a non-rfc patch?

A couple nits...

>  mkfs/xfs_mkfs.c | 40 +++++++++++++++++++++++++---------------
>  1 file changed, 25 insertions(+), 15 deletions(-)
> 
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index c153592c705e..d70fbdb6b15a 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -3374,7 +3374,7 @@ initialise_ag_headers(
>  	struct xfs_mount	*mp,
>  	struct xfs_sb		*sbp,
>  	xfs_agnumber_t		agno,
> -	int			*worst_freelist)
> +	int			*freelist_size)
>  {
>  	struct xfs_perag	*pag = libxfs_perag_get(mp, agno);
>  	struct xfs_agfl		*agfl;
> @@ -3453,8 +3453,22 @@ initialise_ag_headers(
>  		agf->agf_longest = cpu_to_be32(agsize -
>  			XFS_FSB_TO_AGBNO(mp, cfg->logstart) - cfg->logblocks);
>  	}
> -	if (libxfs_alloc_min_freelist(mp, pag) > *worst_freelist)
> -		*worst_freelist = libxfs_alloc_min_freelist(mp, pag);
> +
> +	/*
> +	 * The AGFL size is the same for all AGs because all AGs have the same
> +	 * layout. If this AG sameness ever changes in the future, we'll need to
> +	 * revisit how we initialise the AGFLs.
> +	 */

This is not necessarily the case if the last AG is not full size, right?
I think the comment could point that out (and/or that this works so long
as we don't process the last AG first).

BTW, libxfs_alloc_min_freelist() uses the ->pagf_levels values for the
bno, cnt and rmap btrees to establish the freelist size, and I don't see
where we've assigned ->pagf_levels[XFS_BTNUM_RMAPi] anywhere.

Brian

> +	if (*freelist_size == 0)
> +		*freelist_size = libxfs_alloc_min_freelist(mp, pag);
> +	else if (*freelist_size < libxfs_alloc_min_freelist(mp, pag)) {
> +		fprintf(stderr,
> +_("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
> +			progname, libxfs_alloc_min_freelist(mp, pag),
> +			agno, *freelist_size);
> +		exit(1);
> +	}
> +
>  	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
>  
>  	/*
> @@ -3724,14 +3738,14 @@ static void
>  initialise_ag_freespace(
>  	struct xfs_mount	*mp,
>  	xfs_agnumber_t		agno,
> -	int			worst_freelist)
> +	int			freelist_size)
>  {
>  	struct xfs_alloc_arg	args;
>  	struct xfs_trans	*tp;
>  	struct xfs_trans_res tres = {0};
>  	int			c;
>  
> -	c = libxfs_trans_alloc(mp, &tres, worst_freelist, 0, 0, &tp);
> +	c = libxfs_trans_alloc(mp, &tres, freelist_size, 0, 0, &tp);
>  	if (c)
>  		res_failed(c);
>  
> @@ -3797,7 +3811,7 @@ main(
>  	int			quiet = 0;
>  	char			*protofile = NULL;
>  	char			*protostring = NULL;
> -	int			worst_freelist = 0;
> +	int			freelist_size = 0;
>  
>  	struct libxfs_xinit	xi = {
>  		.isdirect = LIBXFS_DIRECT,
> @@ -4025,16 +4039,12 @@ main(
>  	}
>  
>  	/*
> -	 * Initialise all the static on disk metadata.
> +	 * Initialise all the AG headers on disk.
>  	 */
> -	for (agno = 0; agno < cfg.agcount; agno++)
> -		initialise_ag_headers(&cfg, mp, sbp, agno, &worst_freelist);
> -
> -	/*
> -	 * Initialise the freespace freelists (i.e. AGFLs) in each AG.
> -	 */
> -	for (agno = 0; agno < cfg.agcount; agno++)
> -		initialise_ag_freespace(mp, agno, worst_freelist);
> +	for (agno = 0; agno < cfg.agcount; agno++) {
> +		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size);
> +		initialise_ag_freespace(mp, agno, freelist_size);
> +	}
>  
>  	/*
>  	 * Allocate the root inode and anything else in the proto file.
> -- 
> 2.17.0
> 

  reply	other threads:[~2018-09-06 18:06 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-05  8:19 [RFCRAP PATCH 0/4 v2] mkfs.xfs IO scalability Dave Chinner
2018-09-05  8:19 ` [PATCH 1/4] mkfs: stop zeroing old superblocks excessively Dave Chinner
2018-09-06 13:31   ` Brian Foster
2018-09-07  0:04     ` Dave Chinner
2018-09-07 11:05       ` Brian Foster
2018-09-05  8:19 ` [PATCH 2/4] mkfs: rework AG header initialisation ordering Dave Chinner
2018-09-06 13:31   ` Brian Foster [this message]
2018-09-07  0:08     ` Dave Chinner
2018-09-05  8:19 ` [PATCH 3/4] mkfs: introduce new delayed write buffer list Dave Chinner
2018-09-06 13:32   ` Brian Foster
2018-09-07  0:21     ` Dave Chinner
2018-09-05  8:19 ` [PATCH 4/4] mkfs: Use AIO for batched writeback Dave Chinner
2018-09-06 13:32   ` Brian Foster
2018-09-07  0:30     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180906133123.GB3311@bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.