All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFCRAP PATCH 0/4 v2] mkfs.xfs IO scalability
@ 2018-09-05  8:19 Dave Chinner
  2018-09-05  8:19 ` [PATCH 1/4] mkfs: stop zeroing old superblocks excessively Dave Chinner
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Dave Chinner @ 2018-09-05  8:19 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

More on getting mkfs to be usable to testing unrealistically large
filesystems. The first two patches of this series are unchanged from
yesterday - the second two are new and build on them.

The second two patches hack a delayed write buffer submission list
in the mkfs and libxfs. It's a bit nasty, because I've chosen to
ignore the fact that the libxfs has no concept of async IO or
background write and instead hacked around it. You can see the
result in passing a buffer list to xfs_trans_commit() to get it to
add buffers to the delwri list rather than write them synchronously.

Fast, loose and stupidly dangerous, all in one. Yeehaw!

Better yet, it doesn't even make any difference to
performance - it's just an enabling patch.

The last patch is the performance improvement - it hacks a grotty,
non-re-entrant AIO submission/completion ring to turn the single
threaded sync write batching into a single threaded concurrent IO
loop using AIO. This can drive really deep IO queues as long as it's
got enough queued IO to work with, so mkfs is hacked to only submit
IO every few hundred AGs it initialises.

This sustains queue depths of around 100 IOs and SSD utilisation at
around 80% using about half a CPU, and so the time to make an 8EB
filesystem drops to around 15 minutes.

This is most definitely not production code. This is a load of crap
hacked together in a few hours as a proof of concept. But it's a
successful proof of concept, so what we now need is someone who is
looking around for a substantial project to volunteer to rewrite the
libxfs buffer cache around an AIO submission/completion core and
implement all this in a "proper" fashion. If you're interested, let
me know...

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] mkfs: stop zeroing old superblocks excessively
  2018-09-05  8:19 [RFCRAP PATCH 0/4 v2] mkfs.xfs IO scalability Dave Chinner
@ 2018-09-05  8:19 ` Dave Chinner
  2018-09-06 13:31   ` Brian Foster
  2018-09-05  8:19 ` [PATCH 2/4] mkfs: rework AG header initialisation ordering Dave Chinner
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2018-09-05  8:19 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When making a new filesystem, don't zero superblocks way beyond the
end of the new filesystem. If the old filesystem was an EB scale
filesytsem, then this zeroing requires millions of IOs to complete.
We don't want to do this if the new filesystem on the device is only
going to be 100TB. Sure, zeroing old superblocks a good distance
beyond the new size is a good idea, as is zeroing the ones in the
middle and end, but the other 7,999,000 superblocks? Not so much.

Make a sane cut-off decision - zero out to 10x the size of the new
filesystem, then zero the middle AGs in the old filesystem, then
zero the last ones.

The initial zeroing out to 10x the new fs size means that this code
will only ever trigger in rare corner cases outside a testing
environment - there are very few production workloads where a huge
block device is reused immediately and permanently for a tiny much
smaller filesystem. Those that do this (e.g. on thing provisioned
devices) discard the in use blocks anyway and so the zeroing won't
actually do anything useful.

Time to mkfs a 1TB filsystem on a big device after it held another
larger filesystem:

previous FS size	10PB	100PB	 1EB
old mkfs time		1.95s	8.9s	81.3s
patched			0.95s	1.2s	 1.2s


Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mkfs/xfs_mkfs.c | 64 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 59 insertions(+), 5 deletions(-)

diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 2e53c1e83b6a..c153592c705e 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -1155,14 +1155,15 @@ validate_ag_geometry(
 
 static void
 zero_old_xfs_structures(
-	libxfs_init_t		*xi,
-	xfs_sb_t		*new_sb)
+	struct libxfs_xinit	*xi,
+	struct xfs_sb		*new_sb)
 {
-	void 			*buf;
-	xfs_sb_t 		sb;
+	void			*buf;
+	struct xfs_sb		sb;
 	uint32_t		bsize;
 	int			i;
 	xfs_off_t		off;
+	xfs_off_t		end;
 
 	/*
 	 * We open regular files with O_TRUNC|O_CREAT. Nothing to do here...
@@ -1220,15 +1221,68 @@ zero_old_xfs_structures(
 
 	/*
 	 * block size and basic geometry seems alright, zero the secondaries.
+	 *
+	 * Don't be insane when it comes to overwriting really large filesystems
+	 * as it could take millions of IOs to zero every secondary
+	 * superblock. If we are remaking a huge filesystem, then do the
+	 * zeroing, but if we are replacing it with a small one (typically done
+	 * in test environments, limit the zeroing to:
+	 *
+	 *	- around the range of the new filesystem
+	 *	- the middle of the old filesystem
+	 *	- the end of the old filesystem
+	 *
+	 * Killing the middle and end of the old filesystem will prevent repair
+	 * from finding it with it's fast secondary sb scan algorithm. The slow
+	 * scan algorithm will then confirm the small filesystem geometry by
+	 * brute force scans.
 	 */
 	memset(buf, 0, new_sb->sb_sectsize);
+
+	/* this carefully avoids integer overflows */
+	end = sb.sb_dblocks;
+	if (sb.sb_agcount > 10000 &&
+	    new_sb->sb_dblocks < end / 10)
+		end = new_sb->sb_dblocks * 10;
 	off = 0;
-	for (i = 1; i < sb.sb_agcount; i++)  {
+	for (i = 1; i < sb.sb_agcount && off < end; i++)  {
+		off += sb.sb_agblocks;
+		if (pwrite(xi->dfd, buf, new_sb->sb_sectsize,
+					off << sb.sb_blocklog) == -1)
+			break;
+	}
+
+	if (end == sb.sb_dblocks)
+		return;
+
+	/*
+	 * Trash the middle 1000 AGs of the old fs, which we know has at least
+	 * 10000 AGs at this point. Cast to make sure we are doing 64bit
+	 * multiplies, otherwise off gets truncated to 32 bit. I hate C.
+	 */
+	i = (sb.sb_agcount / 2) - 500;
+	off = (xfs_off_t)sb.sb_agblocks * i;
+	off = (xfs_off_t)sb.sb_agblocks * ((sb.sb_agcount / 2) - 500);
+	end = off + 1000 * sb.sb_agblocks;
+	while (off < end) {
+		if (pwrite(xi->dfd, buf, new_sb->sb_sectsize,
+					off << sb.sb_blocklog) == -1)
+			break;
 		off += sb.sb_agblocks;
+	}
+
+	/*
+	 * Trash the last 1000 AGs of the old fs
+	 */
+	off = (xfs_off_t)sb.sb_agblocks * (sb.sb_agcount - 1000);
+	end = sb.sb_dblocks;
+	while (off < end) {
 		if (pwrite(xi->dfd, buf, new_sb->sb_sectsize,
 					off << sb.sb_blocklog) == -1)
 			break;
+		off += sb.sb_agblocks;
 	}
+
 done:
 	free(buf);
 }
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] mkfs: rework AG header initialisation ordering
  2018-09-05  8:19 [RFCRAP PATCH 0/4 v2] mkfs.xfs IO scalability Dave Chinner
  2018-09-05  8:19 ` [PATCH 1/4] mkfs: stop zeroing old superblocks excessively Dave Chinner
@ 2018-09-05  8:19 ` Dave Chinner
  2018-09-06 13:31   ` Brian Foster
  2018-09-05  8:19 ` [PATCH 3/4] mkfs: introduce new delayed write buffer list Dave Chinner
  2018-09-05  8:19 ` [PATCH 4/4] mkfs: Use AIO for batched writeback Dave Chinner
  3 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2018-09-05  8:19 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When observing the behaviour of an 8EB mkfs execution, I noticed
that a phase where there are a massive number of read/modify/write
cycles occurring. I didn't wait for it to complete - it was obvious
that it was after all the AG headers had been written. That left the
AGFL initialisation as the likely cause.

When all the AG headers don't fit in the libxfs buffer cache, the
AGFL init requires re-reading the AGF, the AGFL, the free space tree
root blocks and the rmap tree root block. They all then get
modified and written back out. 10 IOs per AG. When you have 8
million AGs, that's a lot of extra IO.

Change the initialisation algorithm to initialise the AGFL
immediately after initialising the rest of the headers and
calculating the minimum AGFL size for that AG. This means the
modifications will all hit the buffer cache and this will remove the
IO penalty.

The "worst_freelist" size calculation doesn't change from AG to AG -
it's based on the physical configuration of the AG, and all AGs have
the same configuration. hence we only need to calculate this once,
not for every AG. That allows us to initialise the AGFL immediately
after the rest of the AG has been initialised rather than in a
separate pass.

TIme to make a filesystem from scratch, using a zeroed device so the
force overwrite algorithms are not triggered and -K to avoid
discards:

FS size		10PB	100PB	 1EB
current mkfs	26.9s	214.8s	2484s
patched		11.3s	 70.3s	 709s

In both cases, the IO profile looks identical for the initial AG
header writeout loop. The difference is that the old code then
does the RMW loop to init the AGFL, and that runs at about half the
speed. Hence runtime of the new code is reduce by around 65-70%
simply by avoiding all that IO.


Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mkfs/xfs_mkfs.c | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index c153592c705e..d70fbdb6b15a 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3374,7 +3374,7 @@ initialise_ag_headers(
 	struct xfs_mount	*mp,
 	struct xfs_sb		*sbp,
 	xfs_agnumber_t		agno,
-	int			*worst_freelist)
+	int			*freelist_size)
 {
 	struct xfs_perag	*pag = libxfs_perag_get(mp, agno);
 	struct xfs_agfl		*agfl;
@@ -3453,8 +3453,22 @@ initialise_ag_headers(
 		agf->agf_longest = cpu_to_be32(agsize -
 			XFS_FSB_TO_AGBNO(mp, cfg->logstart) - cfg->logblocks);
 	}
-	if (libxfs_alloc_min_freelist(mp, pag) > *worst_freelist)
-		*worst_freelist = libxfs_alloc_min_freelist(mp, pag);
+
+	/*
+	 * The AGFL size is the same for all AGs because all AGs have the same
+	 * layout. If this AG sameness ever changes in the future, we'll need to
+	 * revisit how we initialise the AGFLs.
+	 */
+	if (*freelist_size == 0)
+		*freelist_size = libxfs_alloc_min_freelist(mp, pag);
+	else if (*freelist_size < libxfs_alloc_min_freelist(mp, pag)) {
+		fprintf(stderr,
+_("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
+			progname, libxfs_alloc_min_freelist(mp, pag),
+			agno, *freelist_size);
+		exit(1);
+	}
+
 	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
 
 	/*
@@ -3724,14 +3738,14 @@ static void
 initialise_ag_freespace(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
-	int			worst_freelist)
+	int			freelist_size)
 {
 	struct xfs_alloc_arg	args;
 	struct xfs_trans	*tp;
 	struct xfs_trans_res tres = {0};
 	int			c;
 
-	c = libxfs_trans_alloc(mp, &tres, worst_freelist, 0, 0, &tp);
+	c = libxfs_trans_alloc(mp, &tres, freelist_size, 0, 0, &tp);
 	if (c)
 		res_failed(c);
 
@@ -3797,7 +3811,7 @@ main(
 	int			quiet = 0;
 	char			*protofile = NULL;
 	char			*protostring = NULL;
-	int			worst_freelist = 0;
+	int			freelist_size = 0;
 
 	struct libxfs_xinit	xi = {
 		.isdirect = LIBXFS_DIRECT,
@@ -4025,16 +4039,12 @@ main(
 	}
 
 	/*
-	 * Initialise all the static on disk metadata.
+	 * Initialise all the AG headers on disk.
 	 */
-	for (agno = 0; agno < cfg.agcount; agno++)
-		initialise_ag_headers(&cfg, mp, sbp, agno, &worst_freelist);
-
-	/*
-	 * Initialise the freespace freelists (i.e. AGFLs) in each AG.
-	 */
-	for (agno = 0; agno < cfg.agcount; agno++)
-		initialise_ag_freespace(mp, agno, worst_freelist);
+	for (agno = 0; agno < cfg.agcount; agno++) {
+		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size);
+		initialise_ag_freespace(mp, agno, freelist_size);
+	}
 
 	/*
 	 * Allocate the root inode and anything else in the proto file.
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] mkfs: introduce new delayed write buffer list
  2018-09-05  8:19 [RFCRAP PATCH 0/4 v2] mkfs.xfs IO scalability Dave Chinner
  2018-09-05  8:19 ` [PATCH 1/4] mkfs: stop zeroing old superblocks excessively Dave Chinner
  2018-09-05  8:19 ` [PATCH 2/4] mkfs: rework AG header initialisation ordering Dave Chinner
@ 2018-09-05  8:19 ` Dave Chinner
  2018-09-06 13:32   ` Brian Foster
  2018-09-05  8:19 ` [PATCH 4/4] mkfs: Use AIO for batched writeback Dave Chinner
  3 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2018-09-05  8:19 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Similar to the kernel concept of delayed write buffers, modify the
xfs_buf to have an internal list head we can use to park dirty
buffers we need to write back for later processing. This enables us
to control writeback directly, rather than have it occur as a side
effect of buffer cache LRU pressure.

Because the whole transaction subsystem is different in userspace,
we need to pass the delwri list to the commit code so that it can
add the buffers dirtied in the transaction to the delwri list rather
than writing them back immediately. This is really special case code
for mkfs because we don't have a proper metadata writeback setup
like we do in the kernel. It's a crutch to enable mkfs to do
async writeback, nothing more.

By itself, this change does not improve performance - IO
dispatch from mkfs is still synchronous, so it can't drive a queue
depth of more than 1. But we now have batched writeback....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/xfs_trans.h |  2 ++
 libxfs/libxfs_io.h  |  6 +++++
 libxfs/rdwr.c       | 65 +++++++++++++++++++++++++++++++++++++++++++++
 libxfs/trans.c      | 47 ++++++++++++++++++++++++--------
 mkfs/xfs_mkfs.c     | 37 +++++++++++++++-----------
 5 files changed, 131 insertions(+), 26 deletions(-)

diff --git a/include/xfs_trans.h b/include/xfs_trans.h
index 63972e4fff0f..25de8b7c757c 100644
--- a/include/xfs_trans.h
+++ b/include/xfs_trans.h
@@ -84,6 +84,8 @@ int	libxfs_trans_alloc(struct xfs_mount *mp, struct xfs_trans_res *resp,
 			   struct xfs_trans **tpp);
 int	libxfs_trans_alloc_empty(struct xfs_mount *mp, struct xfs_trans **tpp);
 int	libxfs_trans_commit(struct xfs_trans *);
+int	libxfs_trans_commit_delwri(struct xfs_trans *tp,
+				   struct list_head *delwri_list);
 void	libxfs_trans_cancel(struct xfs_trans *);
 struct xfs_buf *libxfs_trans_getsb(struct xfs_trans *, struct xfs_mount *, int);
 
diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
index 12064d798a2d..c69cc7cd7ec5 100644
--- a/libxfs/libxfs_io.h
+++ b/libxfs/libxfs_io.h
@@ -66,6 +66,7 @@ typedef struct xfs_buf {
 	struct xfs_buf_map	*b_maps;
 	struct xfs_buf_map	__b_map;
 	int			b_nmaps;
+	struct list_head	b_list;
 #ifdef XFS_BUF_TRACING
 	struct list_head	b_lock_list;
 	const char		*b_func;
@@ -81,6 +82,7 @@ enum xfs_buf_flags_t {	/* b_flags bits */
 	LIBXFS_B_UPTODATE	= 0x0008,	/* buffer is sync'd to disk */
 	LIBXFS_B_DISCONTIG	= 0x0010,	/* discontiguous buffer */
 	LIBXFS_B_UNCHECKED	= 0x0020,	/* needs verification */
+	LIBXFS_B_DELWRI_Q	= 0x0040,	/* buffer is on a delwri list */
 };
 
 #define XFS_BUF_DADDR_NULL		((xfs_daddr_t) (-1LL))
@@ -168,6 +170,10 @@ extern void	libxfs_putbuf (xfs_buf_t *);
 
 #endif
 
+extern void	libxfs_buf_delwri_add(struct xfs_buf *bp, int flags,
+			struct list_head *delwri_list);
+extern int	libxfs_buf_delwri_flush(struct list_head *delwri_list);
+
 extern void	libxfs_readbuf_verify(struct xfs_buf *bp,
 			const struct xfs_buf_ops *ops);
 extern xfs_buf_t *libxfs_getsb(struct xfs_mount *, int);
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index 14a4633e9fa6..7fbaae571abe 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -579,6 +579,7 @@ __initbuf(xfs_buf_t *bp, struct xfs_buftarg *btp, xfs_daddr_t bno,
 	bp->b_holder = 0;
 	bp->b_recur = 0;
 	bp->b_ops = NULL;
+	list_head_init(&bp->b_list);
 
 	if (!bp->b_maps) {
 		bp->b_nmaps = 1;
@@ -1196,6 +1197,70 @@ libxfs_writebuf(xfs_buf_t *bp, int flags)
 	return 0;
 }
 
+void
+libxfs_buf_delwri_add(
+	struct xfs_buf		*bp,
+	int			flags,
+	struct list_head	*delwri_list)
+{
+	if (bp->b_flags & LIBXFS_B_DELWRI_Q) {
+		libxfs_putbuf(bp);
+		return;
+	}
+
+	libxfs_writebuf_int(bp, flags);
+	bp->b_flags |= LIBXFS_B_DELWRI_Q;
+	list_add(&bp->b_list, delwri_list);
+}
+
+/*
+ * Compare function is more complex than it needs to be because
+ * the return value is only 32 bits and we are doing comparisons
+ * on 64 bit values
+ */
+static int
+xfs_buf_cmp(
+	void		*priv,
+	struct list_head *a,
+	struct list_head *b)
+{
+	struct xfs_buf	*ap = container_of(a, struct xfs_buf, b_list);
+	struct xfs_buf	*bp = container_of(b, struct xfs_buf, b_list);
+	xfs_daddr_t	diff;
+
+	diff = ap->b_maps[0].bm_bn - bp->b_maps[0].bm_bn;
+	if (diff < 0)
+		return -1;
+	if (diff > 0)
+		return 1;
+	return 0;
+}
+
+/* Processes entire list, but only returns the first error found */
+int
+libxfs_buf_delwri_flush(
+	struct list_head	*delwri_list)
+{
+	struct xfs_buf		*bp;
+	int			 error = 0;
+
+	list_sort(NULL, delwri_list, xfs_buf_cmp);
+	while (!list_empty(delwri_list)) {
+		bp = list_first_entry(delwri_list, struct xfs_buf, b_list);
+		list_del_init(&bp->b_list);
+		bp->b_flags &= ~LIBXFS_B_DELWRI_Q;
+		if (!bp->b_error && (bp->b_flags & LIBXFS_B_DIRTY)) {
+			int ret;
+			ret = libxfs_writebufr(bp);
+			if (ret && !error)
+				error = ret;
+		}
+		libxfs_putbuf(bp);
+	}
+	return error;
+}
+
+
 void
 libxfs_iomove(xfs_buf_t *bp, uint boff, int len, void *data, int flags)
 {
diff --git a/libxfs/trans.c b/libxfs/trans.c
index 2bb0d3b8e2d1..c3da46479efa 100644
--- a/libxfs/trans.c
+++ b/libxfs/trans.c
@@ -728,10 +728,11 @@ inode_item_done(
 
 static void
 buf_item_done(
-	xfs_buf_log_item_t	*bip)
+	struct xfs_buf_log_item	*bip,
+	struct list_head	*delwri_list)
 {
-	xfs_buf_t		*bp;
-	int			hold;
+	struct xfs_buf		*bp;
+	bool			hold;
 	extern kmem_zone_t	*xfs_buf_item_zone;
 
 	bp = bip->bli_buf;
@@ -745,7 +746,13 @@ buf_item_done(
 		fprintf(stderr, "flushing/staling buffer %p (hold=%d)\n",
 			bp, hold);
 #endif
-		libxfs_writebuf_int(bp, 0);
+		if (delwri_list) {
+			/* delwri list needs to hold on to the buffer here */
+			libxfs_buf_delwri_add(bp, 0, delwri_list);
+			hold = true;
+		} else {
+			libxfs_writebuf_int(bp, 0);
+		}
 	}
 	if (hold)
 		bip->bli_flags &= ~XFS_BLI_HOLD;
@@ -757,7 +764,8 @@ buf_item_done(
 
 static void
 trans_committed(
-	xfs_trans_t		*tp)
+	struct xfs_trans	*tp,
+	struct list_head	*delwri_list)
 {
 	struct xfs_log_item	*lip, *next;
 
@@ -765,7 +773,7 @@ trans_committed(
 		xfs_trans_del_item(lip);
 
 		if (lip->li_type == XFS_LI_BUF)
-			buf_item_done((xfs_buf_log_item_t *)lip);
+			buf_item_done((xfs_buf_log_item_t *)lip, delwri_list);
 		else if (lip->li_type == XFS_LI_INODE)
 			inode_item_done((xfs_inode_log_item_t *)lip);
 		else {
@@ -828,11 +836,12 @@ xfs_trans_free_items(
 /*
  * Commit the changes represented by this transaction
  */
-int
-libxfs_trans_commit(
-	xfs_trans_t	*tp)
+static int
+trans_commit(
+	struct xfs_trans	*tp,
+	struct list_head	*delwri_list)
 {
-	xfs_sb_t	*sbp;
+	struct xfs_sb		*sbp;
 
 	if (tp == NULL)
 		return 0;
@@ -862,9 +871,25 @@ libxfs_trans_commit(
 #ifdef XACT_DEBUG
 	fprintf(stderr, "committing dirty transaction %p\n", tp);
 #endif
-	trans_committed(tp);
+	trans_committed(tp, delwri_list);
 
 	/* That's it for the transaction structure.  Free it. */
 	xfs_trans_free(tp);
 	return 0;
 }
+
+int
+libxfs_trans_commit(
+	struct xfs_trans	*tp)
+{
+	return trans_commit(tp, NULL);
+}
+
+int
+libxfs_trans_commit_delwri(
+	struct xfs_trans	*tp,
+	struct list_head	*delwri_list)
+{
+	return trans_commit(tp, delwri_list);
+}
+
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index d70fbdb6b15a..b751b1fcb4a3 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -3374,7 +3374,8 @@ initialise_ag_headers(
 	struct xfs_mount	*mp,
 	struct xfs_sb		*sbp,
 	xfs_agnumber_t		agno,
-	int			*freelist_size)
+	int			*freelist_size,
+	struct list_head	*delwri_list)
 {
 	struct xfs_perag	*pag = libxfs_perag_get(mp, agno);
 	struct xfs_agfl		*agfl;
@@ -3402,7 +3403,7 @@ initialise_ag_headers(
 	buf->b_ops = &xfs_sb_buf_ops;
 	memset(buf->b_addr, 0, cfg->sectorsize);
 	libxfs_sb_to_disk(buf->b_addr, sbp);
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * AG header block: freespace
@@ -3469,7 +3470,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 		exit(1);
 	}
 
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * AG freelist header block
@@ -3489,7 +3490,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 			agfl->agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);
 	}
 
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * AG header block: inodes
@@ -3518,7 +3519,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 		platform_uuid_copy(&agi->agi_uuid, &sbp->sb_uuid);
 	for (c = 0; c < XFS_AGI_UNLINKED_BUCKETS; c++)
 		agi->agi_unlinked[c] = cpu_to_be32(NULLAGINO);
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * BNO btree root block
@@ -3570,7 +3571,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 	if (!arec->ar_blockcount)
 		block->bb_numrecs = 0;
 
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * CNT btree root block
@@ -3612,7 +3613,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 	if (!arec->ar_blockcount)
 		block->bb_numrecs = 0;
 
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * refcount btree root block
@@ -3626,7 +3627,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 		block = XFS_BUF_TO_BLOCK(buf);
 		memset(block, 0, cfg->blocksize);
 		libxfs_btree_init_block(mp, buf, XFS_BTNUM_REFC, 0, 0, agno, 0);
-		libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+		libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 	}
 
 	/*
@@ -3639,7 +3640,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 	block = XFS_BUF_TO_BLOCK(buf);
 	memset(block, 0, cfg->blocksize);
 	libxfs_btree_init_block(mp, buf, XFS_BTNUM_INO, 0, 0, agno, 0);
-	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 
 	/*
 	 * Free INO btree root block
@@ -3652,7 +3653,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 		block = XFS_BUF_TO_BLOCK(buf);
 		memset(block, 0, cfg->blocksize);
 		libxfs_btree_init_block(mp, buf, XFS_BTNUM_FINO, 0, 0, agno, 0);
-		libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+		libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 	}
 
 	/* RMAP btree root block */
@@ -3728,7 +3729,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
 			be16_add_cpu(&block->bb_numrecs, 1);
 		}
 
-		libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
+		libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
 	}
 
 	libxfs_perag_put(pag);
@@ -3738,7 +3739,8 @@ static void
 initialise_ag_freespace(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
-	int			freelist_size)
+	int			freelist_size,
+	struct list_head	*delwri_list)
 {
 	struct xfs_alloc_arg	args;
 	struct xfs_trans	*tp;
@@ -3758,7 +3760,7 @@ initialise_ag_freespace(
 
 	libxfs_alloc_fix_freelist(&args, 0);
 	libxfs_perag_put(args.pag);
-	libxfs_trans_commit(tp);
+	libxfs_trans_commit_delwri(tp, delwri_list);
 }
 
 /*
@@ -3812,6 +3814,7 @@ main(
 	char			*protofile = NULL;
 	char			*protostring = NULL;
 	int			freelist_size = 0;
+	LIST_HEAD		(delwri_list);
 
 	struct libxfs_xinit	xi = {
 		.isdirect = LIBXFS_DIRECT,
@@ -4042,9 +4045,13 @@ main(
 	 * Initialise all the AG headers on disk.
 	 */
 	for (agno = 0; agno < cfg.agcount; agno++) {
-		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size);
-		initialise_ag_freespace(mp, agno, freelist_size);
+		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size,
+				      &delwri_list);
+		initialise_ag_freespace(mp, agno, freelist_size, &delwri_list);
+		if (agno && !(agno % 100))
+			libxfs_buf_delwri_flush(&delwri_list);
 	}
+	libxfs_buf_delwri_flush(&delwri_list);
 
 	/*
 	 * Allocate the root inode and anything else in the proto file.
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] mkfs: Use AIO for batched writeback
  2018-09-05  8:19 [RFCRAP PATCH 0/4 v2] mkfs.xfs IO scalability Dave Chinner
                   ` (2 preceding siblings ...)
  2018-09-05  8:19 ` [PATCH 3/4] mkfs: introduce new delayed write buffer list Dave Chinner
@ 2018-09-05  8:19 ` Dave Chinner
  2018-09-06 13:32   ` Brian Foster
  3 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2018-09-05  8:19 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

What it says. 8EB filesystem can now be made in 15 minutes on
storage that can do 100,000 IOPS. Before this patch mkfs topped out
at about 13,000 IOPS.

Proof of concept proven.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 libxfs/libxfs_io.h |   1 +
 libxfs/rdwr.c      | 170 +++++++++++++++++++++++++++++++++++++++++++--
 mkfs/Makefile      |   2 +-
 mkfs/xfs_mkfs.c    |   2 +-
 4 files changed, 169 insertions(+), 6 deletions(-)

diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
index c69cc7cd7ec5..2b11d008998d 100644
--- a/libxfs/libxfs_io.h
+++ b/libxfs/libxfs_io.h
@@ -67,6 +67,7 @@ typedef struct xfs_buf {
 	struct xfs_buf_map	__b_map;
 	int			b_nmaps;
 	struct list_head	b_list;
+	int			b_io_count;
 #ifdef XFS_BUF_TRACING
 	struct list_head	b_lock_list;
 	const char		*b_func;
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index 7fbaae571abe..5336f6a1e1de 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -20,6 +20,9 @@
 
 #include "libxfs.h"		/* for LIBXFS_EXIT_ON_FAILURE */
 
+/* XXX: all the aio stuff is hacky and proof of concept only */
+#include <libaio.h>
+
 /*
  * Important design/architecture note:
  *
@@ -1236,6 +1239,151 @@ xfs_buf_cmp(
 	return 0;
 }
 
+/* hacky AIO stuff that'll handle a single delwri queue flush at a time */
+#define MAX_AIO_EVENTS 1024
+io_context_t ctxp;
+struct iocb *iocbs[MAX_AIO_EVENTS];
+struct io_event ioevents[MAX_AIO_EVENTS];
+int aio_next;
+int aio_flight;
+
+static void
+init_aio_delwri(void)
+{
+	int i, r;
+
+	memset(&ctxp, 0, sizeof(ctxp));
+	r = io_setup(MAX_AIO_EVENTS, &ctxp);
+	if (r) {
+		printf("FAIL! io_setup returned %d\n", r);
+		exit(1);
+	}
+	for (i = 0; i < MAX_AIO_EVENTS; ++i) {
+		iocbs[i] = calloc(1, sizeof(struct iocb));
+		if (iocbs[i] == NULL) {
+			printf("failed to allocate an iocb\n");
+			exit(1);
+		}
+	}
+}
+
+int
+get_write_completions(
+	int	threshold)
+{
+	int	error = 0;
+	int	i, r;
+
+	while (aio_flight > threshold) {
+		/* gather up some completions */
+		r = io_getevents(ctxp, 1, MAX_AIO_EVENTS, ioevents, NULL);
+		if (r < 0)  {
+			printf("FAIL! io_getevents returned %d\n", r);
+			exit(1);
+		}
+
+		aio_flight -= r;
+		for (i = 0; i < r; ++i) {
+			struct xfs_buf *bp = ioevents[i].data;
+			if (ioevents[i].res < 0)
+				bp->b_error = ioevents[i].res;
+			else if (ioevents[i].res != bp->b_bcount)
+				bp->b_error = -EIO;
+
+			if (--bp->b_io_count == 0 && !bp->b_error) {
+				bp->b_flags |= LIBXFS_B_UPTODATE;
+				bp->b_flags &= ~(LIBXFS_B_DIRTY | LIBXFS_B_EXIT |
+						 LIBXFS_B_UNCHECKED);
+				libxfs_putbuf(bp);
+			} else if (!error) {
+				error = bp->b_error;
+			}
+		}
+		/* wait for a short while */
+		usleep(1000);
+	}
+	return error;
+}
+
+static int
+__aio_write(
+	struct xfs_buf  *bp,
+	void *buf,
+	int len,
+	off64_t offset)
+{
+	int	r, i;
+	int	fd = libxfs_device_to_fd(bp->b_target->dev);
+
+	i = aio_next++ % MAX_AIO_EVENTS;
+	aio_flight++;
+	bp->b_io_count++;
+
+	io_prep_pwrite(iocbs[i], fd, buf, len, offset);
+	iocbs[i]->data = bp;
+	r = io_submit(ctxp, 1, &iocbs[i]);
+	if (r != 1) {
+		aio_flight--;
+		if (bp->b_flags & LIBXFS_B_EXIT)
+			exit(1);
+		return -EIO;
+	}
+	return 0;
+
+}
+
+static int
+do_write(
+	struct xfs_buf	*bp)
+{
+	/*
+	 * we never write buffers that are marked stale. This indicates they
+	 * contain data that has been invalidated, and even if the buffer is
+	 * dirty it must *never* be written. Verifiers are wonderful for finding
+	 * bugs like this. Make sure the error is obvious as to the cause.
+	 */
+	if (bp->b_flags & LIBXFS_B_STALE) {
+		bp->b_error = -ESTALE;
+		return bp->b_error;
+	}
+
+	/*
+	 * clear any pre-existing error status on the buffer. This can occur if
+	 * the buffer is corrupt on disk and the repair process doesn't clear
+	 * the error before fixing and writing it back.
+	 */
+	bp->b_error = 0;
+	if (bp->b_ops) {
+		bp->b_ops->verify_write(bp);
+		if (bp->b_error) {
+			fprintf(stderr,
+	_("%s: write verifer failed on %s bno 0x%llx/0x%x\n"),
+				__func__, bp->b_ops->name,
+				(long long)bp->b_bn, bp->b_bcount);
+			return bp->b_error;
+		}
+	}
+
+	if (!(bp->b_flags & LIBXFS_B_DISCONTIG)) {
+		bp->b_error = __aio_write(bp, bp->b_addr, bp->b_bcount,
+				    LIBXFS_BBTOOFF64(bp->b_bn));
+	} else {
+		int	i;
+		void	*buf = bp->b_addr;
+
+		for (i = 0; i < bp->b_nmaps; i++) {
+			off64_t	offset = LIBXFS_BBTOOFF64(bp->b_maps[i].bm_bn);
+			int len = BBTOB(bp->b_maps[i].bm_len);
+
+			bp->b_error = __aio_write(bp, buf, len, offset);
+			if (bp->b_error)
+				break;	/* XXX: completion wait required! */
+			buf += len;
+		}
+	}
+	return bp->b_error;
+}
+
 /* Processes entire list, but only returns the first error found */
 int
 libxfs_buf_delwri_flush(
@@ -1243,21 +1391,35 @@ libxfs_buf_delwri_flush(
 {
 	struct xfs_buf		*bp;
 	int			 error = 0;
+	static bool		aio_init = false;
+	int			ret;
+
+	if (!aio_init) {
+		init_aio_delwri();
+		aio_init = true;
+	}
 
 	list_sort(NULL, delwri_list, xfs_buf_cmp);
 	while (!list_empty(delwri_list)) {
 		bp = list_first_entry(delwri_list, struct xfs_buf, b_list);
 		list_del_init(&bp->b_list);
 		bp->b_flags &= ~LIBXFS_B_DELWRI_Q;
+		bp->b_io_count = 1;
 		if (!bp->b_error && (bp->b_flags & LIBXFS_B_DIRTY)) {
-			int ret;
-			ret = libxfs_writebufr(bp);
+			ret = do_write(bp);
 			if (ret && !error)
 				error = ret;
 		}
-		libxfs_putbuf(bp);
+		if (--bp->b_io_count == 0)
+			libxfs_putbuf(bp);
+
+		/* wait for some completions if we've dispatched lots */
+		ret = get_write_completions(MAX_AIO_EVENTS / 5 * 4);
+		if (ret && !error)
+			error = ret;
 	}
-	return error;
+	ret = get_write_completions(0);
+	return error ? error : ret;
 }
 
 
diff --git a/mkfs/Makefile b/mkfs/Makefile
index 31482b08d559..e27f10f26b14 100644
--- a/mkfs/Makefile
+++ b/mkfs/Makefile
@@ -11,7 +11,7 @@ HFILES =
 CFILES = proto.c xfs_mkfs.c
 
 LLDLIBS += $(LIBXFS) $(LIBXCMD) $(LIBFROG) $(LIBRT) $(LIBPTHREAD) $(LIBBLKID) \
-	$(LIBUUID)
+	$(LIBUUID) -laio
 LTDEPENDENCIES += $(LIBXFS) $(LIBXCMD) $(LIBFROG)
 LLDFLAGS = -static-libtool-libs
 
diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index b751b1fcb4a3..82e041dd1b48 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -4048,7 +4048,7 @@ main(
 		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size,
 				      &delwri_list);
 		initialise_ag_freespace(mp, agno, freelist_size, &delwri_list);
-		if (agno && !(agno % 100))
+		if (agno && !(agno % 1000))
 			libxfs_buf_delwri_flush(&delwri_list);
 	}
 	libxfs_buf_delwri_flush(&delwri_list);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] mkfs: stop zeroing old superblocks excessively
  2018-09-05  8:19 ` [PATCH 1/4] mkfs: stop zeroing old superblocks excessively Dave Chinner
@ 2018-09-06 13:31   ` Brian Foster
  2018-09-07  0:04     ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2018-09-06 13:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 05, 2018 at 06:19:29PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When making a new filesystem, don't zero superblocks way beyond the
> end of the new filesystem. If the old filesystem was an EB scale
> filesytsem, then this zeroing requires millions of IOs to complete.
> We don't want to do this if the new filesystem on the device is only
> going to be 100TB. Sure, zeroing old superblocks a good distance
> beyond the new size is a good idea, as is zeroing the ones in the
> middle and end, but the other 7,999,000 superblocks? Not so much.
> 
> Make a sane cut-off decision - zero out to 10x the size of the new
> filesystem, then zero the middle AGs in the old filesystem, then
> zero the last ones.
> 
> The initial zeroing out to 10x the new fs size means that this code
> will only ever trigger in rare corner cases outside a testing
> environment - there are very few production workloads where a huge
> block device is reused immediately and permanently for a tiny much
> smaller filesystem. Those that do this (e.g. on thing provisioned
> devices) discard the in use blocks anyway and so the zeroing won't
> actually do anything useful.
> 
> Time to mkfs a 1TB filsystem on a big device after it held another
> larger filesystem:
> 
> previous FS size	10PB	100PB	 1EB
> old mkfs time		1.95s	8.9s	81.3s
> patched			0.95s	1.2s	 1.2s
> 

Certainly a nice speedup...

> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  mkfs/xfs_mkfs.c | 64 +++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 59 insertions(+), 5 deletions(-)
> 
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index 2e53c1e83b6a..c153592c705e 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -1155,14 +1155,15 @@ validate_ag_geometry(
>  
>  static void
>  zero_old_xfs_structures(
> -	libxfs_init_t		*xi,
> -	xfs_sb_t		*new_sb)
> +	struct libxfs_xinit	*xi,
> +	struct xfs_sb		*new_sb)
>  {
> -	void 			*buf;
> -	xfs_sb_t 		sb;
> +	void			*buf;
> +	struct xfs_sb		sb;
>  	uint32_t		bsize;
>  	int			i;
>  	xfs_off_t		off;
> +	xfs_off_t		end;
>  
>  	/*
>  	 * We open regular files with O_TRUNC|O_CREAT. Nothing to do here...
> @@ -1220,15 +1221,68 @@ zero_old_xfs_structures(
>  
>  	/*
>  	 * block size and basic geometry seems alright, zero the secondaries.
> +	 *
> +	 * Don't be insane when it comes to overwriting really large filesystems
> +	 * as it could take millions of IOs to zero every secondary
> +	 * superblock. If we are remaking a huge filesystem, then do the
> +	 * zeroing, but if we are replacing it with a small one (typically done
> +	 * in test environments, limit the zeroing to:
> +	 *
> +	 *	- around the range of the new filesystem
> +	 *	- the middle of the old filesystem
> +	 *	- the end of the old filesystem
> +	 *
> +	 * Killing the middle and end of the old filesystem will prevent repair
> +	 * from finding it with it's fast secondary sb scan algorithm. The slow
> +	 * scan algorithm will then confirm the small filesystem geometry by
> +	 * brute force scans.
>  	 */
>  	memset(buf, 0, new_sb->sb_sectsize);
> +
> +	/* this carefully avoids integer overflows */
> +	end = sb.sb_dblocks;
> +	if (sb.sb_agcount > 10000 &&
> +	    new_sb->sb_dblocks < end / 10)
> +		end = new_sb->sb_dblocks * 10;

... but what's with the 10k agcount cutoff? Just a number out of a hat
to demonstrate the improvement..?

Given that you've done the work to rough in an AIO buffered write
mechanism for mkfs, have you considered whether we can find a way to
apply that mechanism here? I'm guessing that the result of using AIO
wouldn't be quite as impressive of not doing I/O at all, but as you've
already noted, this algorithmic optimization is more targeted at test
environments than production ones. The AIO approach sounds like it could
be more broadly beneficial, even if not quite as fast in those
particular test cases.

>  	off = 0;
> -	for (i = 1; i < sb.sb_agcount; i++)  {
> +	for (i = 1; i < sb.sb_agcount && off < end; i++)  {
> +		off += sb.sb_agblocks;
> +		if (pwrite(xi->dfd, buf, new_sb->sb_sectsize,
> +					off << sb.sb_blocklog) == -1)
> +			break;
> +	}
> +
> +	if (end == sb.sb_dblocks)
> +		return;
> +
> +	/*
> +	 * Trash the middle 1000 AGs of the old fs, which we know has at least
> +	 * 10000 AGs at this point. Cast to make sure we are doing 64bit
> +	 * multiplies, otherwise off gets truncated to 32 bit. I hate C.
> +	 */
> +	i = (sb.sb_agcount / 2) - 500;
> +	off = (xfs_off_t)sb.sb_agblocks * i;
> +	off = (xfs_off_t)sb.sb_agblocks * ((sb.sb_agcount / 2) - 500);

Looks like a couple lines of dead code there.

Brian

> +	end = off + 1000 * sb.sb_agblocks;
> +	while (off < end) {
> +		if (pwrite(xi->dfd, buf, new_sb->sb_sectsize,
> +					off << sb.sb_blocklog) == -1)
> +			break;
>  		off += sb.sb_agblocks;
> +	}
> +
> +	/*
> +	 * Trash the last 1000 AGs of the old fs
> +	 */
> +	off = (xfs_off_t)sb.sb_agblocks * (sb.sb_agcount - 1000);
> +	end = sb.sb_dblocks;
> +	while (off < end) {
>  		if (pwrite(xi->dfd, buf, new_sb->sb_sectsize,
>  					off << sb.sb_blocklog) == -1)
>  			break;
> +		off += sb.sb_agblocks;
>  	}
> +
>  done:
>  	free(buf);
>  }
> -- 
> 2.17.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] mkfs: rework AG header initialisation ordering
  2018-09-05  8:19 ` [PATCH 2/4] mkfs: rework AG header initialisation ordering Dave Chinner
@ 2018-09-06 13:31   ` Brian Foster
  2018-09-07  0:08     ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2018-09-06 13:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 05, 2018 at 06:19:30PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When observing the behaviour of an 8EB mkfs execution, I noticed
> that a phase where there are a massive number of read/modify/write
> cycles occurring. I didn't wait for it to complete - it was obvious
> that it was after all the AG headers had been written. That left the
> AGFL initialisation as the likely cause.
> 
> When all the AG headers don't fit in the libxfs buffer cache, the
> AGFL init requires re-reading the AGF, the AGFL, the free space tree
> root blocks and the rmap tree root block. They all then get
> modified and written back out. 10 IOs per AG. When you have 8
> million AGs, that's a lot of extra IO.
> 
> Change the initialisation algorithm to initialise the AGFL
> immediately after initialising the rest of the headers and
> calculating the minimum AGFL size for that AG. This means the
> modifications will all hit the buffer cache and this will remove the
> IO penalty.
> 
> The "worst_freelist" size calculation doesn't change from AG to AG -
> it's based on the physical configuration of the AG, and all AGs have
> the same configuration. hence we only need to calculate this once,
> not for every AG. That allows us to initialise the AGFL immediately
> after the rest of the AG has been initialised rather than in a
> separate pass.
> 
> TIme to make a filesystem from scratch, using a zeroed device so the
> force overwrite algorithms are not triggered and -K to avoid
> discards:
> 
> FS size		10PB	100PB	 1EB
> current mkfs	26.9s	214.8s	2484s
> patched		11.3s	 70.3s	 709s
> 
> In both cases, the IO profile looks identical for the initial AG
> header writeout loop. The difference is that the old code then
> does the RMW loop to init the AGFL, and that runs at about half the
> speed. Hence runtime of the new code is reduce by around 65-70%
> simply by avoiding all that IO.
> 
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

This one seems like it stands alone as a nice fixup. Were you planning
to send this independently as a non-rfc patch?

A couple nits...

>  mkfs/xfs_mkfs.c | 40 +++++++++++++++++++++++++---------------
>  1 file changed, 25 insertions(+), 15 deletions(-)
> 
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index c153592c705e..d70fbdb6b15a 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -3374,7 +3374,7 @@ initialise_ag_headers(
>  	struct xfs_mount	*mp,
>  	struct xfs_sb		*sbp,
>  	xfs_agnumber_t		agno,
> -	int			*worst_freelist)
> +	int			*freelist_size)
>  {
>  	struct xfs_perag	*pag = libxfs_perag_get(mp, agno);
>  	struct xfs_agfl		*agfl;
> @@ -3453,8 +3453,22 @@ initialise_ag_headers(
>  		agf->agf_longest = cpu_to_be32(agsize -
>  			XFS_FSB_TO_AGBNO(mp, cfg->logstart) - cfg->logblocks);
>  	}
> -	if (libxfs_alloc_min_freelist(mp, pag) > *worst_freelist)
> -		*worst_freelist = libxfs_alloc_min_freelist(mp, pag);
> +
> +	/*
> +	 * The AGFL size is the same for all AGs because all AGs have the same
> +	 * layout. If this AG sameness ever changes in the future, we'll need to
> +	 * revisit how we initialise the AGFLs.
> +	 */

This is not necessarily the case if the last AG is not full size, right?
I think the comment could point that out (and/or that this works so long
as we don't process the last AG first).

BTW, libxfs_alloc_min_freelist() uses the ->pagf_levels values for the
bno, cnt and rmap btrees to establish the freelist size, and I don't see
where we've assigned ->pagf_levels[XFS_BTNUM_RMAPi] anywhere.

Brian

> +	if (*freelist_size == 0)
> +		*freelist_size = libxfs_alloc_min_freelist(mp, pag);
> +	else if (*freelist_size < libxfs_alloc_min_freelist(mp, pag)) {
> +		fprintf(stderr,
> +_("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
> +			progname, libxfs_alloc_min_freelist(mp, pag),
> +			agno, *freelist_size);
> +		exit(1);
> +	}
> +
>  	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
>  
>  	/*
> @@ -3724,14 +3738,14 @@ static void
>  initialise_ag_freespace(
>  	struct xfs_mount	*mp,
>  	xfs_agnumber_t		agno,
> -	int			worst_freelist)
> +	int			freelist_size)
>  {
>  	struct xfs_alloc_arg	args;
>  	struct xfs_trans	*tp;
>  	struct xfs_trans_res tres = {0};
>  	int			c;
>  
> -	c = libxfs_trans_alloc(mp, &tres, worst_freelist, 0, 0, &tp);
> +	c = libxfs_trans_alloc(mp, &tres, freelist_size, 0, 0, &tp);
>  	if (c)
>  		res_failed(c);
>  
> @@ -3797,7 +3811,7 @@ main(
>  	int			quiet = 0;
>  	char			*protofile = NULL;
>  	char			*protostring = NULL;
> -	int			worst_freelist = 0;
> +	int			freelist_size = 0;
>  
>  	struct libxfs_xinit	xi = {
>  		.isdirect = LIBXFS_DIRECT,
> @@ -4025,16 +4039,12 @@ main(
>  	}
>  
>  	/*
> -	 * Initialise all the static on disk metadata.
> +	 * Initialise all the AG headers on disk.
>  	 */
> -	for (agno = 0; agno < cfg.agcount; agno++)
> -		initialise_ag_headers(&cfg, mp, sbp, agno, &worst_freelist);
> -
> -	/*
> -	 * Initialise the freespace freelists (i.e. AGFLs) in each AG.
> -	 */
> -	for (agno = 0; agno < cfg.agcount; agno++)
> -		initialise_ag_freespace(mp, agno, worst_freelist);
> +	for (agno = 0; agno < cfg.agcount; agno++) {
> +		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size);
> +		initialise_ag_freespace(mp, agno, freelist_size);
> +	}
>  
>  	/*
>  	 * Allocate the root inode and anything else in the proto file.
> -- 
> 2.17.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] mkfs: introduce new delayed write buffer list
  2018-09-05  8:19 ` [PATCH 3/4] mkfs: introduce new delayed write buffer list Dave Chinner
@ 2018-09-06 13:32   ` Brian Foster
  2018-09-07  0:21     ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2018-09-06 13:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 05, 2018 at 06:19:31PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Similar to the kernel concept of delayed write buffers, modify the
> xfs_buf to have an internal list head we can use to park dirty
> buffers we need to write back for later processing. This enables us
> to control writeback directly, rather than have it occur as a side
> effect of buffer cache LRU pressure.
> 
> Because the whole transaction subsystem is different in userspace,
> we need to pass the delwri list to the commit code so that it can
> add the buffers dirtied in the transaction to the delwri list rather
> than writing them back immediately. This is really special case code
> for mkfs because we don't have a proper metadata writeback setup
> like we do in the kernel. It's a crutch to enable mkfs to do
> async writeback, nothing more.
> 
> By itself, this change does not improve performance - IO
> dispatch from mkfs is still synchronous, so it can't drive a queue
> depth of more than 1. But we now have batched writeback....
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/xfs_trans.h |  2 ++
>  libxfs/libxfs_io.h  |  6 +++++
>  libxfs/rdwr.c       | 65 +++++++++++++++++++++++++++++++++++++++++++++
>  libxfs/trans.c      | 47 ++++++++++++++++++++++++--------
>  mkfs/xfs_mkfs.c     | 37 +++++++++++++++-----------
>  5 files changed, 131 insertions(+), 26 deletions(-)
> 
> diff --git a/include/xfs_trans.h b/include/xfs_trans.h
> index 63972e4fff0f..25de8b7c757c 100644
> --- a/include/xfs_trans.h
> +++ b/include/xfs_trans.h
> @@ -84,6 +84,8 @@ int	libxfs_trans_alloc(struct xfs_mount *mp, struct xfs_trans_res *resp,
>  			   struct xfs_trans **tpp);
>  int	libxfs_trans_alloc_empty(struct xfs_mount *mp, struct xfs_trans **tpp);
>  int	libxfs_trans_commit(struct xfs_trans *);
> +int	libxfs_trans_commit_delwri(struct xfs_trans *tp,
> +				   struct list_head *delwri_list);
>  void	libxfs_trans_cancel(struct xfs_trans *);
>  struct xfs_buf *libxfs_trans_getsb(struct xfs_trans *, struct xfs_mount *, int);
>  
> diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
> index 12064d798a2d..c69cc7cd7ec5 100644
> --- a/libxfs/libxfs_io.h
> +++ b/libxfs/libxfs_io.h
> @@ -66,6 +66,7 @@ typedef struct xfs_buf {
>  	struct xfs_buf_map	*b_maps;
>  	struct xfs_buf_map	__b_map;
>  	int			b_nmaps;
> +	struct list_head	b_list;
>  #ifdef XFS_BUF_TRACING
>  	struct list_head	b_lock_list;
>  	const char		*b_func;
> @@ -81,6 +82,7 @@ enum xfs_buf_flags_t {	/* b_flags bits */
>  	LIBXFS_B_UPTODATE	= 0x0008,	/* buffer is sync'd to disk */
>  	LIBXFS_B_DISCONTIG	= 0x0010,	/* discontiguous buffer */
>  	LIBXFS_B_UNCHECKED	= 0x0020,	/* needs verification */
> +	LIBXFS_B_DELWRI_Q	= 0x0040,	/* buffer is on a delwri list */
>  };
>  
>  #define XFS_BUF_DADDR_NULL		((xfs_daddr_t) (-1LL))
> @@ -168,6 +170,10 @@ extern void	libxfs_putbuf (xfs_buf_t *);
>  
>  #endif
>  
> +extern void	libxfs_buf_delwri_add(struct xfs_buf *bp, int flags,
> +			struct list_head *delwri_list);
> +extern int	libxfs_buf_delwri_flush(struct list_head *delwri_list);
> +
>  extern void	libxfs_readbuf_verify(struct xfs_buf *bp,
>  			const struct xfs_buf_ops *ops);
>  extern xfs_buf_t *libxfs_getsb(struct xfs_mount *, int);
> diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
> index 14a4633e9fa6..7fbaae571abe 100644
> --- a/libxfs/rdwr.c
> +++ b/libxfs/rdwr.c
> @@ -579,6 +579,7 @@ __initbuf(xfs_buf_t *bp, struct xfs_buftarg *btp, xfs_daddr_t bno,
>  	bp->b_holder = 0;
>  	bp->b_recur = 0;
>  	bp->b_ops = NULL;
> +	list_head_init(&bp->b_list);
>  
>  	if (!bp->b_maps) {
>  		bp->b_nmaps = 1;
> @@ -1196,6 +1197,70 @@ libxfs_writebuf(xfs_buf_t *bp, int flags)
>  	return 0;
>  }
>  
> +void
> +libxfs_buf_delwri_add(
> +	struct xfs_buf		*bp,
> +	int			flags,
> +	struct list_head	*delwri_list)
> +{
> +	if (bp->b_flags & LIBXFS_B_DELWRI_Q) {
> +		libxfs_putbuf(bp);
> +		return;
> +	}
> +
> +	libxfs_writebuf_int(bp, flags);
> +	bp->b_flags |= LIBXFS_B_DELWRI_Q;
> +	list_add(&bp->b_list, delwri_list);
> +}
> +
> +/*
> + * Compare function is more complex than it needs to be because
> + * the return value is only 32 bits and we are doing comparisons
> + * on 64 bit values
> + */
> +static int
> +xfs_buf_cmp(
> +	void		*priv,
> +	struct list_head *a,
> +	struct list_head *b)
> +{
> +	struct xfs_buf	*ap = container_of(a, struct xfs_buf, b_list);
> +	struct xfs_buf	*bp = container_of(b, struct xfs_buf, b_list);
> +	xfs_daddr_t	diff;
> +
> +	diff = ap->b_maps[0].bm_bn - bp->b_maps[0].bm_bn;
> +	if (diff < 0)
> +		return -1;
> +	if (diff > 0)
> +		return 1;
> +	return 0;
> +}
> +
> +/* Processes entire list, but only returns the first error found */
> +int
> +libxfs_buf_delwri_flush(
> +	struct list_head	*delwri_list)
> +{
> +	struct xfs_buf		*bp;
> +	int			 error = 0;
> +
> +	list_sort(NULL, delwri_list, xfs_buf_cmp);
> +	while (!list_empty(delwri_list)) {
> +		bp = list_first_entry(delwri_list, struct xfs_buf, b_list);
> +		list_del_init(&bp->b_list);
> +		bp->b_flags &= ~LIBXFS_B_DELWRI_Q;
> +		if (!bp->b_error && (bp->b_flags & LIBXFS_B_DIRTY)) {
> +			int ret;
> +			ret = libxfs_writebufr(bp);
> +			if (ret && !error)
> +				error = ret;
> +		}
> +		libxfs_putbuf(bp);
> +	}
> +	return error;
> +}
> +
> +
>  void
>  libxfs_iomove(xfs_buf_t *bp, uint boff, int len, void *data, int flags)
>  {
> diff --git a/libxfs/trans.c b/libxfs/trans.c
> index 2bb0d3b8e2d1..c3da46479efa 100644
> --- a/libxfs/trans.c
> +++ b/libxfs/trans.c
> @@ -728,10 +728,11 @@ inode_item_done(
>  
>  static void
>  buf_item_done(
> -	xfs_buf_log_item_t	*bip)
> +	struct xfs_buf_log_item	*bip,
> +	struct list_head	*delwri_list)
>  {
> -	xfs_buf_t		*bp;
> -	int			hold;
> +	struct xfs_buf		*bp;
> +	bool			hold;
>  	extern kmem_zone_t	*xfs_buf_item_zone;
>  
>  	bp = bip->bli_buf;
> @@ -745,7 +746,13 @@ buf_item_done(
>  		fprintf(stderr, "flushing/staling buffer %p (hold=%d)\n",
>  			bp, hold);
>  #endif
> -		libxfs_writebuf_int(bp, 0);
> +		if (delwri_list) {
> +			/* delwri list needs to hold on to the buffer here */
> +			libxfs_buf_delwri_add(bp, 0, delwri_list);
> +			hold = true;

This seems a bit flakey. IIUC, the hold is set here because the delwri
queue either needs the reference until after I/O completion (or it
dropped the callers reference already if the buffer were already present
on the queue). If BLI_HOLD is set in this case, however, haven't we
basically stolen the caller's reference?

I'm guessing this probably doesn't ever happen in the limited scope of
mkfs, so consider that an interface design nit for now. I suppose a more
robust mechanism might more closely resemble the kernel approach where
the delwri_queue() acquires its own reference on the buf (somehow or
another as applied to the xfsprogs buffer management system, I don't
have it all paged in atm).

Brian

> +		} else {
> +			libxfs_writebuf_int(bp, 0);
> +		}
>  	}
>  	if (hold)
>  		bip->bli_flags &= ~XFS_BLI_HOLD;
> @@ -757,7 +764,8 @@ buf_item_done(
>  
>  static void
>  trans_committed(
> -	xfs_trans_t		*tp)
> +	struct xfs_trans	*tp,
> +	struct list_head	*delwri_list)
>  {
>  	struct xfs_log_item	*lip, *next;
>  
> @@ -765,7 +773,7 @@ trans_committed(
>  		xfs_trans_del_item(lip);
>  
>  		if (lip->li_type == XFS_LI_BUF)
> -			buf_item_done((xfs_buf_log_item_t *)lip);
> +			buf_item_done((xfs_buf_log_item_t *)lip, delwri_list);
>  		else if (lip->li_type == XFS_LI_INODE)
>  			inode_item_done((xfs_inode_log_item_t *)lip);
>  		else {
> @@ -828,11 +836,12 @@ xfs_trans_free_items(
>  /*
>   * Commit the changes represented by this transaction
>   */
> -int
> -libxfs_trans_commit(
> -	xfs_trans_t	*tp)
> +static int
> +trans_commit(
> +	struct xfs_trans	*tp,
> +	struct list_head	*delwri_list)
>  {
> -	xfs_sb_t	*sbp;
> +	struct xfs_sb		*sbp;
>  
>  	if (tp == NULL)
>  		return 0;
> @@ -862,9 +871,25 @@ libxfs_trans_commit(
>  #ifdef XACT_DEBUG
>  	fprintf(stderr, "committing dirty transaction %p\n", tp);
>  #endif
> -	trans_committed(tp);
> +	trans_committed(tp, delwri_list);
>  
>  	/* That's it for the transaction structure.  Free it. */
>  	xfs_trans_free(tp);
>  	return 0;
>  }
> +
> +int
> +libxfs_trans_commit(
> +	struct xfs_trans	*tp)
> +{
> +	return trans_commit(tp, NULL);
> +}
> +
> +int
> +libxfs_trans_commit_delwri(
> +	struct xfs_trans	*tp,
> +	struct list_head	*delwri_list)
> +{
> +	return trans_commit(tp, delwri_list);
> +}
> +
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index d70fbdb6b15a..b751b1fcb4a3 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -3374,7 +3374,8 @@ initialise_ag_headers(
>  	struct xfs_mount	*mp,
>  	struct xfs_sb		*sbp,
>  	xfs_agnumber_t		agno,
> -	int			*freelist_size)
> +	int			*freelist_size,
> +	struct list_head	*delwri_list)
>  {
>  	struct xfs_perag	*pag = libxfs_perag_get(mp, agno);
>  	struct xfs_agfl		*agfl;
> @@ -3402,7 +3403,7 @@ initialise_ag_headers(
>  	buf->b_ops = &xfs_sb_buf_ops;
>  	memset(buf->b_addr, 0, cfg->sectorsize);
>  	libxfs_sb_to_disk(buf->b_addr, sbp);
> -	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
> +	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
>  
>  	/*
>  	 * AG header block: freespace
> @@ -3469,7 +3470,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
>  		exit(1);
>  	}
>  
> -	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
> +	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
>  
>  	/*
>  	 * AG freelist header block
> @@ -3489,7 +3490,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
>  			agfl->agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);
>  	}
>  
> -	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
> +	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
>  
>  	/*
>  	 * AG header block: inodes
> @@ -3518,7 +3519,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
>  		platform_uuid_copy(&agi->agi_uuid, &sbp->sb_uuid);
>  	for (c = 0; c < XFS_AGI_UNLINKED_BUCKETS; c++)
>  		agi->agi_unlinked[c] = cpu_to_be32(NULLAGINO);
> -	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
> +	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
>  
>  	/*
>  	 * BNO btree root block
> @@ -3570,7 +3571,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
>  	if (!arec->ar_blockcount)
>  		block->bb_numrecs = 0;
>  
> -	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
> +	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
>  
>  	/*
>  	 * CNT btree root block
> @@ -3612,7 +3613,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
>  	if (!arec->ar_blockcount)
>  		block->bb_numrecs = 0;
>  
> -	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
> +	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
>  
>  	/*
>  	 * refcount btree root block
> @@ -3626,7 +3627,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
>  		block = XFS_BUF_TO_BLOCK(buf);
>  		memset(block, 0, cfg->blocksize);
>  		libxfs_btree_init_block(mp, buf, XFS_BTNUM_REFC, 0, 0, agno, 0);
> -		libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
> +		libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
>  	}
>  
>  	/*
> @@ -3639,7 +3640,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
>  	block = XFS_BUF_TO_BLOCK(buf);
>  	memset(block, 0, cfg->blocksize);
>  	libxfs_btree_init_block(mp, buf, XFS_BTNUM_INO, 0, 0, agno, 0);
> -	libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
> +	libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
>  
>  	/*
>  	 * Free INO btree root block
> @@ -3652,7 +3653,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
>  		block = XFS_BUF_TO_BLOCK(buf);
>  		memset(block, 0, cfg->blocksize);
>  		libxfs_btree_init_block(mp, buf, XFS_BTNUM_FINO, 0, 0, agno, 0);
> -		libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
> +		libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
>  	}
>  
>  	/* RMAP btree root block */
> @@ -3728,7 +3729,7 @@ _("%s: Abort! Freelist size (%u) for AG %u not constant (%u)!\n"),
>  			be16_add_cpu(&block->bb_numrecs, 1);
>  		}
>  
> -		libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
> +		libxfs_buf_delwri_add(buf, LIBXFS_EXIT_ON_FAILURE, delwri_list);
>  	}
>  
>  	libxfs_perag_put(pag);
> @@ -3738,7 +3739,8 @@ static void
>  initialise_ag_freespace(
>  	struct xfs_mount	*mp,
>  	xfs_agnumber_t		agno,
> -	int			freelist_size)
> +	int			freelist_size,
> +	struct list_head	*delwri_list)
>  {
>  	struct xfs_alloc_arg	args;
>  	struct xfs_trans	*tp;
> @@ -3758,7 +3760,7 @@ initialise_ag_freespace(
>  
>  	libxfs_alloc_fix_freelist(&args, 0);
>  	libxfs_perag_put(args.pag);
> -	libxfs_trans_commit(tp);
> +	libxfs_trans_commit_delwri(tp, delwri_list);
>  }
>  
>  /*
> @@ -3812,6 +3814,7 @@ main(
>  	char			*protofile = NULL;
>  	char			*protostring = NULL;
>  	int			freelist_size = 0;
> +	LIST_HEAD		(delwri_list);
>  
>  	struct libxfs_xinit	xi = {
>  		.isdirect = LIBXFS_DIRECT,
> @@ -4042,9 +4045,13 @@ main(
>  	 * Initialise all the AG headers on disk.
>  	 */
>  	for (agno = 0; agno < cfg.agcount; agno++) {
> -		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size);
> -		initialise_ag_freespace(mp, agno, freelist_size);
> +		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size,
> +				      &delwri_list);
> +		initialise_ag_freespace(mp, agno, freelist_size, &delwri_list);
> +		if (agno && !(agno % 100))
> +			libxfs_buf_delwri_flush(&delwri_list);
>  	}
> +	libxfs_buf_delwri_flush(&delwri_list);
>  
>  	/*
>  	 * Allocate the root inode and anything else in the proto file.
> -- 
> 2.17.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] mkfs: Use AIO for batched writeback
  2018-09-05  8:19 ` [PATCH 4/4] mkfs: Use AIO for batched writeback Dave Chinner
@ 2018-09-06 13:32   ` Brian Foster
  2018-09-07  0:30     ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2018-09-06 13:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 05, 2018 at 06:19:32PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> What it says. 8EB filesystem can now be made in 15 minutes on
> storage that can do 100,000 IOPS. Before this patch mkfs topped out
> at about 13,000 IOPS.
> 
> Proof of concept proven.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  libxfs/libxfs_io.h |   1 +
>  libxfs/rdwr.c      | 170 +++++++++++++++++++++++++++++++++++++++++++--
>  mkfs/Makefile      |   2 +-
>  mkfs/xfs_mkfs.c    |   2 +-
>  4 files changed, 169 insertions(+), 6 deletions(-)
> 
> diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
> index c69cc7cd7ec5..2b11d008998d 100644
> --- a/libxfs/libxfs_io.h
> +++ b/libxfs/libxfs_io.h
> @@ -67,6 +67,7 @@ typedef struct xfs_buf {
>  	struct xfs_buf_map	__b_map;
>  	int			b_nmaps;
>  	struct list_head	b_list;
> +	int			b_io_count;
>  #ifdef XFS_BUF_TRACING
>  	struct list_head	b_lock_list;
>  	const char		*b_func;
> diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
> index 7fbaae571abe..5336f6a1e1de 100644
> --- a/libxfs/rdwr.c
> +++ b/libxfs/rdwr.c
> @@ -20,6 +20,9 @@
>  
>  #include "libxfs.h"		/* for LIBXFS_EXIT_ON_FAILURE */
>  
> +/* XXX: all the aio stuff is hacky and proof of concept only */
> +#include <libaio.h>
> +
>  /*
>   * Important design/architecture note:
>   *
> @@ -1236,6 +1239,151 @@ xfs_buf_cmp(
>  	return 0;
>  }
>  
> +/* hacky AIO stuff that'll handle a single delwri queue flush at a time */
> +#define MAX_AIO_EVENTS 1024
> +io_context_t ctxp;
> +struct iocb *iocbs[MAX_AIO_EVENTS];
> +struct io_event ioevents[MAX_AIO_EVENTS];
> +int aio_next;
> +int aio_flight;
> +
> +static void
> +init_aio_delwri(void)
> +{
> +	int i, r;
> +
> +	memset(&ctxp, 0, sizeof(ctxp));
> +	r = io_setup(MAX_AIO_EVENTS, &ctxp);
> +	if (r) {
> +		printf("FAIL! io_setup returned %d\n", r);
> +		exit(1);
> +	}
> +	for (i = 0; i < MAX_AIO_EVENTS; ++i) {
> +		iocbs[i] = calloc(1, sizeof(struct iocb));
> +		if (iocbs[i] == NULL) {
> +			printf("failed to allocate an iocb\n");
> +			exit(1);
> +		}
> +	}
> +}
> +
> +int
> +get_write_completions(
> +	int	threshold)
> +{
> +	int	error = 0;
> +	int	i, r;
> +
> +	while (aio_flight > threshold) {
> +		/* gather up some completions */
> +		r = io_getevents(ctxp, 1, MAX_AIO_EVENTS, ioevents, NULL);
> +		if (r < 0)  {
> +			printf("FAIL! io_getevents returned %d\n", r);
> +			exit(1);
> +		}
> +
> +		aio_flight -= r;
> +		for (i = 0; i < r; ++i) {
> +			struct xfs_buf *bp = ioevents[i].data;
> +			if (ioevents[i].res < 0)
> +				bp->b_error = ioevents[i].res;
> +			else if (ioevents[i].res != bp->b_bcount)
> +				bp->b_error = -EIO;
> +
> +			if (--bp->b_io_count == 0 && !bp->b_error) {
> +				bp->b_flags |= LIBXFS_B_UPTODATE;
> +				bp->b_flags &= ~(LIBXFS_B_DIRTY | LIBXFS_B_EXIT |
> +						 LIBXFS_B_UNCHECKED);

I'd think we need these flags updates on the submission side
->b_io_count decrement as well, but I guess the single-threadedness of
the whole things dictates that I/O will only ever complete here.

Which begs the question.. why elevate ->b_io_count in the delwri submit
path at all? FWIW, I can see writing this consistently either way: take
the I/O ref and handle completion in both places, or rely on the fact
that completion occurs in one place and the flush only needs to know if
the buffer count is still elevated immediately after submission.

> +				libxfs_putbuf(bp);
> +			} else if (!error) {
> +				error = bp->b_error;
> +			}
> +		}
> +		/* wait for a short while */
> +		usleep(1000);
> +	}
> +	return error;
> +}
> +
> +static int
> +__aio_write(
> +	struct xfs_buf  *bp,
> +	void *buf,
> +	int len,
> +	off64_t offset)
> +{
> +	int	r, i;
> +	int	fd = libxfs_device_to_fd(bp->b_target->dev);
> +
> +	i = aio_next++ % MAX_AIO_EVENTS;
> +	aio_flight++;
> +	bp->b_io_count++;
> +
> +	io_prep_pwrite(iocbs[i], fd, buf, len, offset);
> +	iocbs[i]->data = bp;
> +	r = io_submit(ctxp, 1, &iocbs[i]);
> +	if (r != 1) {
> +		aio_flight--;

Probably need to decrement ->b_io_count here as well..?

Brian

> +		if (bp->b_flags & LIBXFS_B_EXIT)
> +			exit(1);
> +		return -EIO;
> +	}
> +	return 0;
> +
> +}
> +
> +static int
> +do_write(
> +	struct xfs_buf	*bp)
> +{
> +	/*
> +	 * we never write buffers that are marked stale. This indicates they
> +	 * contain data that has been invalidated, and even if the buffer is
> +	 * dirty it must *never* be written. Verifiers are wonderful for finding
> +	 * bugs like this. Make sure the error is obvious as to the cause.
> +	 */
> +	if (bp->b_flags & LIBXFS_B_STALE) {
> +		bp->b_error = -ESTALE;
> +		return bp->b_error;
> +	}
> +
> +	/*
> +	 * clear any pre-existing error status on the buffer. This can occur if
> +	 * the buffer is corrupt on disk and the repair process doesn't clear
> +	 * the error before fixing and writing it back.
> +	 */
> +	bp->b_error = 0;
> +	if (bp->b_ops) {
> +		bp->b_ops->verify_write(bp);
> +		if (bp->b_error) {
> +			fprintf(stderr,
> +	_("%s: write verifer failed on %s bno 0x%llx/0x%x\n"),
> +				__func__, bp->b_ops->name,
> +				(long long)bp->b_bn, bp->b_bcount);
> +			return bp->b_error;
> +		}
> +	}
> +
> +	if (!(bp->b_flags & LIBXFS_B_DISCONTIG)) {
> +		bp->b_error = __aio_write(bp, bp->b_addr, bp->b_bcount,
> +				    LIBXFS_BBTOOFF64(bp->b_bn));
> +	} else {
> +		int	i;
> +		void	*buf = bp->b_addr;
> +
> +		for (i = 0; i < bp->b_nmaps; i++) {
> +			off64_t	offset = LIBXFS_BBTOOFF64(bp->b_maps[i].bm_bn);
> +			int len = BBTOB(bp->b_maps[i].bm_len);
> +
> +			bp->b_error = __aio_write(bp, buf, len, offset);
> +			if (bp->b_error)
> +				break;	/* XXX: completion wait required! */
> +			buf += len;
> +		}
> +	}
> +	return bp->b_error;
> +}
> +
>  /* Processes entire list, but only returns the first error found */
>  int
>  libxfs_buf_delwri_flush(
> @@ -1243,21 +1391,35 @@ libxfs_buf_delwri_flush(
>  {
>  	struct xfs_buf		*bp;
>  	int			 error = 0;
> +	static bool		aio_init = false;
> +	int			ret;
> +
> +	if (!aio_init) {
> +		init_aio_delwri();
> +		aio_init = true;
> +	}
>  
>  	list_sort(NULL, delwri_list, xfs_buf_cmp);
>  	while (!list_empty(delwri_list)) {
>  		bp = list_first_entry(delwri_list, struct xfs_buf, b_list);
>  		list_del_init(&bp->b_list);
>  		bp->b_flags &= ~LIBXFS_B_DELWRI_Q;
> +		bp->b_io_count = 1;
>  		if (!bp->b_error && (bp->b_flags & LIBXFS_B_DIRTY)) {
> -			int ret;
> -			ret = libxfs_writebufr(bp);
> +			ret = do_write(bp);
>  			if (ret && !error)
>  				error = ret;
>  		}
> -		libxfs_putbuf(bp);
> +		if (--bp->b_io_count == 0)
> +			libxfs_putbuf(bp);
> +
> +		/* wait for some completions if we've dispatched lots */
> +		ret = get_write_completions(MAX_AIO_EVENTS / 5 * 4);
> +		if (ret && !error)
> +			error = ret;
>  	}
> -	return error;
> +	ret = get_write_completions(0);
> +	return error ? error : ret;
>  }
>  
>  
> diff --git a/mkfs/Makefile b/mkfs/Makefile
> index 31482b08d559..e27f10f26b14 100644
> --- a/mkfs/Makefile
> +++ b/mkfs/Makefile
> @@ -11,7 +11,7 @@ HFILES =
>  CFILES = proto.c xfs_mkfs.c
>  
>  LLDLIBS += $(LIBXFS) $(LIBXCMD) $(LIBFROG) $(LIBRT) $(LIBPTHREAD) $(LIBBLKID) \
> -	$(LIBUUID)
> +	$(LIBUUID) -laio
>  LTDEPENDENCIES += $(LIBXFS) $(LIBXCMD) $(LIBFROG)
>  LLDFLAGS = -static-libtool-libs
>  
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index b751b1fcb4a3..82e041dd1b48 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -4048,7 +4048,7 @@ main(
>  		initialise_ag_headers(&cfg, mp, sbp, agno, &freelist_size,
>  				      &delwri_list);
>  		initialise_ag_freespace(mp, agno, freelist_size, &delwri_list);
> -		if (agno && !(agno % 100))
> +		if (agno && !(agno % 1000))
>  			libxfs_buf_delwri_flush(&delwri_list);
>  	}
>  	libxfs_buf_delwri_flush(&delwri_list);
> -- 
> 2.17.0
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] mkfs: stop zeroing old superblocks excessively
  2018-09-06 13:31   ` Brian Foster
@ 2018-09-07  0:04     ` Dave Chinner
  2018-09-07 11:05       ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2018-09-07  0:04 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Sep 06, 2018 at 09:31:08AM -0400, Brian Foster wrote:
> On Wed, Sep 05, 2018 at 06:19:29PM +1000, Dave Chinner wrote:
....
> > @@ -1220,15 +1221,68 @@ zero_old_xfs_structures(
> >  
> >  	/*
> >  	 * block size and basic geometry seems alright, zero the secondaries.
> > +	 *
> > +	 * Don't be insane when it comes to overwriting really large filesystems
> > +	 * as it could take millions of IOs to zero every secondary
> > +	 * superblock. If we are remaking a huge filesystem, then do the
> > +	 * zeroing, but if we are replacing it with a small one (typically done
> > +	 * in test environments, limit the zeroing to:
> > +	 *
> > +	 *	- around the range of the new filesystem
> > +	 *	- the middle of the old filesystem
> > +	 *	- the end of the old filesystem
> > +	 *
> > +	 * Killing the middle and end of the old filesystem will prevent repair
> > +	 * from finding it with it's fast secondary sb scan algorithm. The slow
> > +	 * scan algorithm will then confirm the small filesystem geometry by
> > +	 * brute force scans.
> >  	 */
> >  	memset(buf, 0, new_sb->sb_sectsize);
> > +
> > +	/* this carefully avoids integer overflows */
> > +	end = sb.sb_dblocks;
> > +	if (sb.sb_agcount > 10000 &&
> > +	    new_sb->sb_dblocks < end / 10)
> > +		end = new_sb->sb_dblocks * 10;
> 
> ... but what's with the 10k agcount cutoff? Just a number out of a hat
> to demonstrate the improvement..?

yeah, I pulled it from a hat, but mainly so it only triggers the new
"partial zeroing" code on really large devices that had a large
filesystem and now we're making a much smaller filesystem on it.

There's no real reason for it except for the fact I haven't done the
verification needed to make this the default behaviour (hence the
RFCRAP status :).

> Given that you've done the work to rough in an AIO buffered write
> mechanism for mkfs, have you considered whether we can find a way to
> apply that mechanism here?

It would have be another AIO context because this loop doesn't
use xfs_bufs.

> I'm guessing that the result of using AIO
> wouldn't be quite as impressive of not doing I/O at all, but as you've
> already noted, this algorithmic optimization is more targeted at test
> environments than production ones. The AIO approach sounds like it could
> be more broadly beneficial, even if not quite as fast in those
> particular test cases.

I just don't think it's worth the hassle as, like you said, it's
easier just to avoid the IO altogether.

> 
> >  	off = 0;
> > -	for (i = 1; i < sb.sb_agcount; i++)  {
> > +	for (i = 1; i < sb.sb_agcount && off < end; i++)  {
> > +		off += sb.sb_agblocks;
> > +		if (pwrite(xi->dfd, buf, new_sb->sb_sectsize,
> > +					off << sb.sb_blocklog) == -1)
> > +			break;
> > +	}
> > +
> > +	if (end == sb.sb_dblocks)
> > +		return;
> > +
> > +	/*
> > +	 * Trash the middle 1000 AGs of the old fs, which we know has at least
> > +	 * 10000 AGs at this point. Cast to make sure we are doing 64bit
> > +	 * multiplies, otherwise off gets truncated to 32 bit. I hate C.
> > +	 */
> > +	i = (sb.sb_agcount / 2) - 500;
> > +	off = (xfs_off_t)sb.sb_agblocks * i;
> > +	off = (xfs_off_t)sb.sb_agblocks * ((sb.sb_agcount / 2) - 500);
> 
> Looks like a couple lines of dead code there.

Yup, didn't clean it up properly, did I?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] mkfs: rework AG header initialisation ordering
  2018-09-06 13:31   ` Brian Foster
@ 2018-09-07  0:08     ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2018-09-07  0:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Sep 06, 2018 at 09:31:24AM -0400, Brian Foster wrote:
> On Wed, Sep 05, 2018 at 06:19:30PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When observing the behaviour of an 8EB mkfs execution, I noticed
> > that a phase where there are a massive number of read/modify/write
> > cycles occurring. I didn't wait for it to complete - it was obvious
> > that it was after all the AG headers had been written. That left the
> > AGFL initialisation as the likely cause.
> > 
> > When all the AG headers don't fit in the libxfs buffer cache, the
> > AGFL init requires re-reading the AGF, the AGFL, the free space tree
> > root blocks and the rmap tree root block. They all then get
> > modified and written back out. 10 IOs per AG. When you have 8
> > million AGs, that's a lot of extra IO.
> > 
> > Change the initialisation algorithm to initialise the AGFL
> > immediately after initialising the rest of the headers and
> > calculating the minimum AGFL size for that AG. This means the
> > modifications will all hit the buffer cache and this will remove the
> > IO penalty.
> > 
> > The "worst_freelist" size calculation doesn't change from AG to AG -
> > it's based on the physical configuration of the AG, and all AGs have
> > the same configuration. hence we only need to calculate this once,
> > not for every AG. That allows us to initialise the AGFL immediately
> > after the rest of the AG has been initialised rather than in a
> > separate pass.
> > 
> > TIme to make a filesystem from scratch, using a zeroed device so the
> > force overwrite algorithms are not triggered and -K to avoid
> > discards:
> > 
> > FS size		10PB	100PB	 1EB
> > current mkfs	26.9s	214.8s	2484s
> > patched		11.3s	 70.3s	 709s
> > 
> > In both cases, the IO profile looks identical for the initial AG
> > header writeout loop. The difference is that the old code then
> > does the RMW loop to init the AGFL, and that runs at about half the
> > speed. Hence runtime of the new code is reduce by around 65-70%
> > simply by avoiding all that IO.
> > 
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> This one seems like it stands alone as a nice fixup. Were you planning
> to send this independently as a non-rfc patch?

Eventually. This hasn't gone through xfstests yet, so it may yet let
the smoke out....

> >  mkfs/xfs_mkfs.c | 40 +++++++++++++++++++++++++---------------
> >  1 file changed, 25 insertions(+), 15 deletions(-)
> > 
> > diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> > index c153592c705e..d70fbdb6b15a 100644
> > --- a/mkfs/xfs_mkfs.c
> > +++ b/mkfs/xfs_mkfs.c
> > @@ -3374,7 +3374,7 @@ initialise_ag_headers(
> >  	struct xfs_mount	*mp,
> >  	struct xfs_sb		*sbp,
> >  	xfs_agnumber_t		agno,
> > -	int			*worst_freelist)
> > +	int			*freelist_size)
> >  {
> >  	struct xfs_perag	*pag = libxfs_perag_get(mp, agno);
> >  	struct xfs_agfl		*agfl;
> > @@ -3453,8 +3453,22 @@ initialise_ag_headers(
> >  		agf->agf_longest = cpu_to_be32(agsize -
> >  			XFS_FSB_TO_AGBNO(mp, cfg->logstart) - cfg->logblocks);
> >  	}
> > -	if (libxfs_alloc_min_freelist(mp, pag) > *worst_freelist)
> > -		*worst_freelist = libxfs_alloc_min_freelist(mp, pag);
> > +
> > +	/*
> > +	 * The AGFL size is the same for all AGs because all AGs have the same
> > +	 * layout. If this AG sameness ever changes in the future, we'll need to
> > +	 * revisit how we initialise the AGFLs.
> > +	 */
> 
> This is not necessarily the case if the last AG is not full size, right?
> I think the comment could point that out (and/or that this works so long
> as we don't process the last AG first).

Right. I can add that to the comment.

> BTW, libxfs_alloc_min_freelist() uses the ->pagf_levels values for the
> bno, cnt and rmap btrees to establish the freelist size, and I don't see
> where we've assigned ->pagf_levels[XFS_BTNUM_RMAPi] anywhere.

Well spotted! I'll write another patch for that.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] mkfs: introduce new delayed write buffer list
  2018-09-06 13:32   ` Brian Foster
@ 2018-09-07  0:21     ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2018-09-07  0:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Sep 06, 2018 at 09:32:15AM -0400, Brian Foster wrote:
> On Wed, Sep 05, 2018 at 06:19:31PM +1000, Dave Chinner wrote:
> > diff --git a/libxfs/trans.c b/libxfs/trans.c
> > index 2bb0d3b8e2d1..c3da46479efa 100644
> > --- a/libxfs/trans.c
> > +++ b/libxfs/trans.c
> > @@ -728,10 +728,11 @@ inode_item_done(
> >  
> >  static void
> >  buf_item_done(
> > -	xfs_buf_log_item_t	*bip)
> > +	struct xfs_buf_log_item	*bip,
> > +	struct list_head	*delwri_list)
> >  {
> > -	xfs_buf_t		*bp;
> > -	int			hold;
> > +	struct xfs_buf		*bp;
> > +	bool			hold;
> >  	extern kmem_zone_t	*xfs_buf_item_zone;
> >  
> >  	bp = bip->bli_buf;
> > @@ -745,7 +746,13 @@ buf_item_done(
> >  		fprintf(stderr, "flushing/staling buffer %p (hold=%d)\n",
> >  			bp, hold);
> >  #endif
> > -		libxfs_writebuf_int(bp, 0);
> > +		if (delwri_list) {
> > +			/* delwri list needs to hold on to the buffer here */
> > +			libxfs_buf_delwri_add(bp, 0, delwri_list);
> > +			hold = true;
> 
> This seems a bit flakey.

Yup, it's a nasty hack to avoid having to put proper reference
counting into the userspace xfs_bufs.

> IIUC, the hold is set here because the delwri
> queue either needs the reference until after I/O completion (or it
> dropped the callers reference already if the buffer were already present
> on the queue). If BLI_HOLD is set in this case, however, haven't we
> basically stolen the caller's reference?

Yes, that's precisely why it's a nasty hack.

> I'm guessing this probably doesn't ever happen in the limited scope of
> mkfs, so consider that an interface design nit for now.

Right, that's how I got away with it here. :P

> I suppose a more
> robust mechanism might more closely resemble the kernel approach where
> the delwri_queue() acquires its own reference on the buf (somehow or
> another as applied to the xfsprogs buffer management system, I don't
> have it all paged in atm).

The right approach is to port the kernel buffer cache implementation
to libxfs and implement bio_submit() and the bio completion
callbacks via an AIO engine. Then we can add an AIL and convert all
the open coded libxfs_write() calls in this mkfs code to transaction
joins as ordered buffers. That way we don't need the delwri list
hack into xfs_trans_commit() - we just push the AIL every so
often...

That's a lot more work than this proof of concept, though.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] mkfs: Use AIO for batched writeback
  2018-09-06 13:32   ` Brian Foster
@ 2018-09-07  0:30     ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2018-09-07  0:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Sep 06, 2018 at 09:32:24AM -0400, Brian Foster wrote:
> On Wed, Sep 05, 2018 at 06:19:32PM +1000, Dave Chinner wrote:
> > +int
> > +get_write_completions(
> > +	int	threshold)
> > +{
> > +	int	error = 0;
> > +	int	i, r;
> > +
> > +	while (aio_flight > threshold) {
> > +		/* gather up some completions */
> > +		r = io_getevents(ctxp, 1, MAX_AIO_EVENTS, ioevents, NULL);
> > +		if (r < 0)  {
> > +			printf("FAIL! io_getevents returned %d\n", r);
> > +			exit(1);
> > +		}
> > +
> > +		aio_flight -= r;
> > +		for (i = 0; i < r; ++i) {
> > +			struct xfs_buf *bp = ioevents[i].data;
> > +			if (ioevents[i].res < 0)
> > +				bp->b_error = ioevents[i].res;
> > +			else if (ioevents[i].res != bp->b_bcount)
> > +				bp->b_error = -EIO;
> > +
> > +			if (--bp->b_io_count == 0 && !bp->b_error) {
> > +				bp->b_flags |= LIBXFS_B_UPTODATE;
> > +				bp->b_flags &= ~(LIBXFS_B_DIRTY | LIBXFS_B_EXIT |
> > +						 LIBXFS_B_UNCHECKED);
> 
> I'd think we need these flags updates on the submission side
> ->b_io_count decrement as well, but I guess the single-threadedness of
> the whole things dictates that I/O will only ever complete here.

Yup - I made lots of shortcuts and hacks because it is a tight
single threaded AIO dispatch/complete loop.

> 
> Which begs the question.. why elevate ->b_io_count in the delwri submit
> path at all?

So we don't drop the reference to the buffer before we've completed
all the individual IOs on a discontiguous buffer.

> FWIW, I can see writing this consistently either way: take
> the I/O ref and handle completion in both places, or rely on the fact
> that completion occurs in one place and the flush only needs to know if
> the buffer count is still elevated immediately after submission.

Sure, but for a quick and dirty AIO ring implementation like this
that's not needed. Of course, none of this code would make it into a
proper AIO engine implementation, so don't get too hung up on all
the shortcuts and simplifications I've taken in this RFCRAP code
to prove the concept is worth persuing....

> > +				libxfs_putbuf(bp);
> > +			} else if (!error) {
> > +				error = bp->b_error;
> > +			}
> > +		}
> > +		/* wait for a short while */
> > +		usleep(1000);

Like this :P

> > +	}
> > +	return error;
> > +}
> > +
> > +static int
> > +__aio_write(
> > +	struct xfs_buf  *bp,
> > +	void *buf,
> > +	int len,
> > +	off64_t offset)
> > +{
> > +	int	r, i;
> > +	int	fd = libxfs_device_to_fd(bp->b_target->dev);
> > +
> > +	i = aio_next++ % MAX_AIO_EVENTS;
> > +	aio_flight++;
> > +	bp->b_io_count++;
> > +
> > +	io_prep_pwrite(iocbs[i], fd, buf, len, offset);
> > +	iocbs[i]->data = bp;
> > +	r = io_submit(ctxp, 1, &iocbs[i]);
> > +	if (r != 1) {
> > +		aio_flight--;
> 
> Probably need to decrement ->b_io_count here as well..?

Yup, that'd be a bug.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/4] mkfs: stop zeroing old superblocks excessively
  2018-09-07  0:04     ` Dave Chinner
@ 2018-09-07 11:05       ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2018-09-07 11:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Sep 07, 2018 at 10:04:32AM +1000, Dave Chinner wrote:
> On Thu, Sep 06, 2018 at 09:31:08AM -0400, Brian Foster wrote:
> > On Wed, Sep 05, 2018 at 06:19:29PM +1000, Dave Chinner wrote:
> ....
> > > @@ -1220,15 +1221,68 @@ zero_old_xfs_structures(
> > >  
> > >  	/*
> > >  	 * block size and basic geometry seems alright, zero the secondaries.
> > > +	 *
> > > +	 * Don't be insane when it comes to overwriting really large filesystems
> > > +	 * as it could take millions of IOs to zero every secondary
> > > +	 * superblock. If we are remaking a huge filesystem, then do the
> > > +	 * zeroing, but if we are replacing it with a small one (typically done
> > > +	 * in test environments, limit the zeroing to:
> > > +	 *
> > > +	 *	- around the range of the new filesystem
> > > +	 *	- the middle of the old filesystem
> > > +	 *	- the end of the old filesystem
> > > +	 *
> > > +	 * Killing the middle and end of the old filesystem will prevent repair
> > > +	 * from finding it with it's fast secondary sb scan algorithm. The slow
> > > +	 * scan algorithm will then confirm the small filesystem geometry by
> > > +	 * brute force scans.
> > >  	 */
> > >  	memset(buf, 0, new_sb->sb_sectsize);
> > > +
> > > +	/* this carefully avoids integer overflows */
> > > +	end = sb.sb_dblocks;
> > > +	if (sb.sb_agcount > 10000 &&
> > > +	    new_sb->sb_dblocks < end / 10)
> > > +		end = new_sb->sb_dblocks * 10;
> > 
> > ... but what's with the 10k agcount cutoff? Just a number out of a hat
> > to demonstrate the improvement..?
> 
> yeah, I pulled it from a hat, but mainly so it only triggers the new
> "partial zeroing" code on really large devices that had a large
> filesystem and now we're making a much smaller filesystem on it.
> 
> There's no real reason for it except for the fact I haven't done the
> verification needed to make this the default behaviour (hence the
> RFCRAP status :).
> 

Ok...

> > Given that you've done the work to rough in an AIO buffered write
> > mechanism for mkfs, have you considered whether we can find a way to
> > apply that mechanism here?
> 
> It would have be another AIO context because this loop doesn't
> use xfs_bufs.
> 

Yup. I wasn't suggesting to necessarily use xfs_bufs, but I suppose that
could be an option if we could find a clever way to do so. It's probably
not worth getting into the weeds of an implementation until the core
I/O infrastructure is more fleshed out.

> > I'm guessing that the result of using AIO
> > wouldn't be quite as impressive of not doing I/O at all, but as you've
> > already noted, this algorithmic optimization is more targeted at test
> > environments than production ones. The AIO approach sounds like it could
> > be more broadly beneficial, even if not quite as fast in those
> > particular test cases.
> 
> I just don't think it's worth the hassle as, like you said, it's
> easier just to avoid the IO altogether.
> 

Well, I said it probably wouldn't be as fast as not doing the I/O at
all. I'm not necessarily convinced it's easier (or better). ;P This
patch is kind of a case in point. It's more code than the current
dead-simple pwrite() loop, it picks an arbitrary cutoff which suggests
it needs more work (and/or complexity) to be more generically
acceptable, and it ultimately requires some kind of evaluation against
(or cooperation with) the repair secondary scan to ensure it's done
properly. If this function could do exactly what it does now, but much
faster (based on the mkfs results you've achieved), then that could be a
more simple approach.

All I'm really saying is that if we're going through the work to create
a fast/efficient I/O mechanism for mkfs, it's worth looking into
applying that mechanism here because it's the same fundamental problem
(applied to the more narrow case of zeroing out that old, huge fs rather
than creating it in the first place). If the code complexity is
comparable (an open/valid question), then at least we stand to benefit
regardless of whether the old -> new fs is an 8x delta rather than 10x
for example, without having to come up with (and maintain) new/rigid
heuristics.

If the benefits aren't as great or there's just too much of an impedence
mismatch between this algorithm and the underlying I/O mechanism to
justify the complexity, then we can always fall back to doing something
like this. I'd just prefer to see some of that analysis before comitting
to this approach.

Brian

> > 
> > >  	off = 0;
> > > -	for (i = 1; i < sb.sb_agcount; i++)  {
> > > +	for (i = 1; i < sb.sb_agcount && off < end; i++)  {
> > > +		off += sb.sb_agblocks;
> > > +		if (pwrite(xi->dfd, buf, new_sb->sb_sectsize,
> > > +					off << sb.sb_blocklog) == -1)
> > > +			break;
> > > +	}
> > > +
> > > +	if (end == sb.sb_dblocks)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * Trash the middle 1000 AGs of the old fs, which we know has at least
> > > +	 * 10000 AGs at this point. Cast to make sure we are doing 64bit
> > > +	 * multiplies, otherwise off gets truncated to 32 bit. I hate C.
> > > +	 */
> > > +	i = (sb.sb_agcount / 2) - 500;
> > > +	off = (xfs_off_t)sb.sb_agblocks * i;
> > > +	off = (xfs_off_t)sb.sb_agblocks * ((sb.sb_agcount / 2) - 500);
> > 
> > Looks like a couple lines of dead code there.
> 
> Yup, didn't clean it up properly, did I?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-09-07 15:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-05  8:19 [RFCRAP PATCH 0/4 v2] mkfs.xfs IO scalability Dave Chinner
2018-09-05  8:19 ` [PATCH 1/4] mkfs: stop zeroing old superblocks excessively Dave Chinner
2018-09-06 13:31   ` Brian Foster
2018-09-07  0:04     ` Dave Chinner
2018-09-07 11:05       ` Brian Foster
2018-09-05  8:19 ` [PATCH 2/4] mkfs: rework AG header initialisation ordering Dave Chinner
2018-09-06 13:31   ` Brian Foster
2018-09-07  0:08     ` Dave Chinner
2018-09-05  8:19 ` [PATCH 3/4] mkfs: introduce new delayed write buffer list Dave Chinner
2018-09-06 13:32   ` Brian Foster
2018-09-07  0:21     ` Dave Chinner
2018-09-05  8:19 ` [PATCH 4/4] mkfs: Use AIO for batched writeback Dave Chinner
2018-09-06 13:32   ` Brian Foster
2018-09-07  0:30     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.