All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
@ 2017-10-26  8:33 Dave Chinner
  2017-10-26  8:33 ` [PATCH 01/14] xfs: factor out AG header initialisation from growfs core Dave Chinner
                   ` (15 more replies)
  0 siblings, 16 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

This patchset is aimed at filesystems that are installed on sparse
block devices, a.k.a thin provisioned devices. The aim of the
patchset is to bring the space management aspect of the storage
stack up into the filesystem rather than keeping it below the
filesystem where users and the filesystem have no clue they are
about to run out of space.

The idea is that thin block devices will be massively
over-provisioned giving the filesystem a large block device address
space to manage, but the filesystem presents itself as a much
smaller filesystem. That is, the space the filesystem presents the
users is much lower than the what the address space teh block device
provides.

This somewhat turns traditional thin provisioning on it's head.
Admins are used to lying through their teeth to users about how much
space they have available, and then they hope to hell that users
never try to store as much data as they've been "provisioned" with.
As a result, the traditional failure case is the block device
running out of space all of a sudden and the filesystem and
users wondering WTF just went wrong with their system.

Moving the space management up into the filesystem by itself doesn't
solve this problem - the thin storage pools can still be
over-committed - but it does allow a new way of managing the space.
Essentially, growing or shrinking a thin filesystem is an
operation that only takes a couple of milliseconds to do because
it's just an accounting trick. It's far less complex than creating
a new file, or even reading data from a file.

Freeing unused space from the filesystem isn't done during a shrink
operation. It is done through discard operations, either dynamically
via the discard mount option or, preferrably, by an fstrim
invocation. This means freeing space in the thin pool is not in any
way related to the management of the filesystem size and space
enforcement even during a grow or shrink operation.

What it means is that the filesystem controls the amount of active
data the user can have in the thin pool. The thin pool usage may be
more or less, depending on snapshots, deduplication,
freed-but-not-discarded space, etc. And because of how low the
overhead of changing the accounting is, users don't need to be given
a filesystem with all the space they might need once in a blue moon.
It is trivial to expand when need, and shrink and release when the
data is removed.

Yes, the underlying thin device that the filesystem sits on gets
provisioned at the "once in a blue moon" size that is requested,
but until that space is needed the filesystem can run at low amounts
of reported free space and so prevent the likelyhood of sudden
thin device pool depletion.

Normally, running a filesysetm for low periods of time at low
amounts of free space is a bad thing. However, for a thin
filesystem, a low amount of usable free space doesn't mean the
filesystem is running near full. The filesystem still has the full
block device address to work with, so has oodles of contiguous free
space hidden from the user. hence it's not until the thin filesystem
grows to be near "non-thin" and is near full that the traditional
"running near ENOSPC" problems arise.

How to stop that from ever happening? e.g. Some one needs 100GB of
space now, but maybe much more than that in a year. So provision a
10TB thin block device and put a 100GB thin filesystem on it.
Problems won't arise until it's been grown to 100x it's original
size.

Yeah, it all requires thinking about the way storage is provisioned
and managed a little bit differently, but the key point to realise
is that grow and shrink effectively become free operations on
thin devices if the filesystem is aware that it's on a thin device.

The patchset has several parts to it. It is built on a 4.14-rc5
kernel with for-next and Darrick's scrub tree from a couple of days
ago merged into it.

The first part of teh series is a growfs refactoring. This can
probably stand alone, and the idea is to move the refactored
infrastructure into libxfs so it can be shared with mkfs. This also
cleans up a lot of the cruft in growfs and so makes it much easier
to add the changes later in the series.

The second part of the patchset moves the functionality of
sb_dblocks into the struct xfs_mount. This provides the separation
of address space checks and capacty related calculations that the
thinspace mods require. This also fixes the problem of freshly made,
empty filesystems reporting 2% of the space as used.

The XFS_IOC_FSGEOMETRY ioctl needed to be bumped to a new version
because the structure needed growing.

Finally, there's the patches that provide thinspace support and the
growfs mods needed to grow and shrink.

I've smoke tested the non-thinspace code paths (running auto tests
on a scrub enabled kernel+userspace right now) as I haven't updated
the userspace code to exercise the thinp code paths yet. I know the
concept works, but my userspace code has an older on-disk format
from the prototype so it will take me a couple of days to update and
work out how to get fstests to integrate it reliably. So this is
mainly a heads-up RFC patchset....

Comments, thoughts, flames all welcome....

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/14] xfs: factor out AG header initialisation from growfs core
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26  8:33 ` [PATCH 02/14] xfs: convert growfs AG header init to use buffer lists Dave Chinner
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The intialisation of new AG headers is mostly common with the
userspace mkfs code and growfs in the kernel, so start factoring it
out so we can move it to libxfs and use it in both places.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_fsops.c | 637 ++++++++++++++++++++++++++++-------------------------
 1 file changed, 331 insertions(+), 306 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 8f22fc579dbb..487a9dca1170 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -148,20 +148,344 @@ xfs_growfs_get_hdr_buf(
 	return bp;
 }
 
+/*
+ * Write new AG headers to disk. Non-transactional, but written
+ * synchronously so they are completed prior to the growfs transaction
+ * being logged.
+ */
+static int
+xfs_grow_ag_headers(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		agsize,
+	xfs_rfsblock_t		*nfree)
+{
+	struct xfs_agf		*agf;
+	struct xfs_agi		*agi;
+	struct xfs_agfl		*agfl;
+	__be32			*agfl_bno;
+	xfs_alloc_rec_t		*arec;
+	struct xfs_buf		*bp;
+	int			bucket;
+	xfs_extlen_t		tmpsize;
+	int			error = 0;
+
+	/*
+	 * AG freespace header block
+	 */
+	bp = xfs_growfs_get_hdr_buf(mp,
+			XFS_AG_DADDR(mp, agno, XFS_AGF_DADDR(mp)),
+			XFS_FSS_TO_BB(mp, 1), 0,
+			&xfs_agf_buf_ops);
+	if (!bp) {
+		error = -ENOMEM;
+		goto out_error;
+	}
+
+	agf = XFS_BUF_TO_AGF(bp);
+	agf->agf_magicnum = cpu_to_be32(XFS_AGF_MAGIC);
+	agf->agf_versionnum = cpu_to_be32(XFS_AGF_VERSION);
+	agf->agf_seqno = cpu_to_be32(agno);
+	agf->agf_length = cpu_to_be32(agsize);
+	agf->agf_roots[XFS_BTNUM_BNOi] = cpu_to_be32(XFS_BNO_BLOCK(mp));
+	agf->agf_roots[XFS_BTNUM_CNTi] = cpu_to_be32(XFS_CNT_BLOCK(mp));
+	agf->agf_levels[XFS_BTNUM_BNOi] = cpu_to_be32(1);
+	agf->agf_levels[XFS_BTNUM_CNTi] = cpu_to_be32(1);
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+		agf->agf_roots[XFS_BTNUM_RMAPi] =
+					cpu_to_be32(XFS_RMAP_BLOCK(mp));
+		agf->agf_levels[XFS_BTNUM_RMAPi] = cpu_to_be32(1);
+		agf->agf_rmap_blocks = cpu_to_be32(1);
+	}
+
+	agf->agf_flfirst = cpu_to_be32(1);
+	agf->agf_fllast = 0;
+	agf->agf_flcount = 0;
+	tmpsize = agsize - mp->m_ag_prealloc_blocks;
+	agf->agf_freeblks = cpu_to_be32(tmpsize);
+	agf->agf_longest = cpu_to_be32(tmpsize);
+	if (xfs_sb_version_hascrc(&mp->m_sb))
+		uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid);
+	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+		agf->agf_refcount_root = cpu_to_be32(
+				xfs_refc_block(mp));
+		agf->agf_refcount_level = cpu_to_be32(1);
+		agf->agf_refcount_blocks = cpu_to_be32(1);
+	}
+
+	error = xfs_bwrite(bp);
+	xfs_buf_relse(bp);
+	if (error)
+		goto out_error;
+
+	/*
+	 * AG freelist header block
+	 */
+	bp = xfs_growfs_get_hdr_buf(mp,
+			XFS_AG_DADDR(mp, agno, XFS_AGFL_DADDR(mp)),
+			XFS_FSS_TO_BB(mp, 1), 0,
+			&xfs_agfl_buf_ops);
+	if (!bp) {
+		error = -ENOMEM;
+		goto out_error;
+	}
+
+	agfl = XFS_BUF_TO_AGFL(bp);
+	if (xfs_sb_version_hascrc(&mp->m_sb)) {
+		agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
+		agfl->agfl_seqno = cpu_to_be32(agno);
+		uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
+	}
+
+	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, bp);
+	for (bucket = 0; bucket < XFS_AGFL_SIZE(mp); bucket++)
+		agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);
+
+	error = xfs_bwrite(bp);
+	xfs_buf_relse(bp);
+	if (error)
+		goto out_error;
+
+	/*
+	 * AG inode header block
+	 */
+	bp = xfs_growfs_get_hdr_buf(mp,
+			XFS_AG_DADDR(mp, agno, XFS_AGI_DADDR(mp)),
+			XFS_FSS_TO_BB(mp, 1), 0,
+			&xfs_agi_buf_ops);
+	if (!bp) {
+		error = -ENOMEM;
+		goto out_error;
+	}
+
+	agi = XFS_BUF_TO_AGI(bp);
+	agi->agi_magicnum = cpu_to_be32(XFS_AGI_MAGIC);
+	agi->agi_versionnum = cpu_to_be32(XFS_AGI_VERSION);
+	agi->agi_seqno = cpu_to_be32(agno);
+	agi->agi_length = cpu_to_be32(agsize);
+	agi->agi_count = 0;
+	agi->agi_root = cpu_to_be32(XFS_IBT_BLOCK(mp));
+	agi->agi_level = cpu_to_be32(1);
+	agi->agi_freecount = 0;
+	agi->agi_newino = cpu_to_be32(NULLAGINO);
+	agi->agi_dirino = cpu_to_be32(NULLAGINO);
+	if (xfs_sb_version_hascrc(&mp->m_sb))
+		uuid_copy(&agi->agi_uuid, &mp->m_sb.sb_meta_uuid);
+	if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+		agi->agi_free_root = cpu_to_be32(XFS_FIBT_BLOCK(mp));
+		agi->agi_free_level = cpu_to_be32(1);
+	}
+	for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++)
+		agi->agi_unlinked[bucket] = cpu_to_be32(NULLAGINO);
+
+	error = xfs_bwrite(bp);
+	xfs_buf_relse(bp);
+	if (error)
+		goto out_error;
+
+	/*
+	 * BNO btree root block
+	 */
+	bp = xfs_growfs_get_hdr_buf(mp,
+			XFS_AGB_TO_DADDR(mp, agno, XFS_BNO_BLOCK(mp)),
+			BTOBB(mp->m_sb.sb_blocksize), 0,
+			&xfs_allocbt_buf_ops);
+
+	if (!bp) {
+		error = -ENOMEM;
+		goto out_error;
+	}
+
+	xfs_btree_init_block(mp, bp, XFS_BTNUM_BNO, 0, 1, agno, 0);
+
+	arec = XFS_ALLOC_REC_ADDR(mp, XFS_BUF_TO_BLOCK(bp), 1);
+	arec->ar_startblock = cpu_to_be32(mp->m_ag_prealloc_blocks);
+	arec->ar_blockcount = cpu_to_be32(
+		agsize - be32_to_cpu(arec->ar_startblock));
+
+	error = xfs_bwrite(bp);
+	xfs_buf_relse(bp);
+	if (error)
+		goto out_error;
+
+	/*
+	 * CNT btree root block
+	 */
+	bp = xfs_growfs_get_hdr_buf(mp,
+			XFS_AGB_TO_DADDR(mp, agno, XFS_CNT_BLOCK(mp)),
+			BTOBB(mp->m_sb.sb_blocksize), 0,
+			&xfs_allocbt_buf_ops);
+	if (!bp) {
+		error = -ENOMEM;
+		goto out_error;
+	}
+
+	xfs_btree_init_block(mp, bp, XFS_BTNUM_CNT, 0, 1, agno, 0);
+
+	arec = XFS_ALLOC_REC_ADDR(mp, XFS_BUF_TO_BLOCK(bp), 1);
+	arec->ar_startblock = cpu_to_be32(mp->m_ag_prealloc_blocks);
+	arec->ar_blockcount = cpu_to_be32(
+		agsize - be32_to_cpu(arec->ar_startblock));
+	*nfree += be32_to_cpu(arec->ar_blockcount);
+
+	error = xfs_bwrite(bp);
+	xfs_buf_relse(bp);
+	if (error)
+		goto out_error;
+
+	/* RMAP btree root block */
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+		struct xfs_rmap_rec	*rrec;
+		struct xfs_btree_block	*block;
+
+		bp = xfs_growfs_get_hdr_buf(mp,
+			XFS_AGB_TO_DADDR(mp, agno, XFS_RMAP_BLOCK(mp)),
+			BTOBB(mp->m_sb.sb_blocksize), 0,
+			&xfs_rmapbt_buf_ops);
+		if (!bp) {
+			error = -ENOMEM;
+			goto out_error;
+		}
+
+		xfs_btree_init_block(mp, bp, XFS_BTNUM_RMAP, 0, 0,
+					agno, 0);
+		block = XFS_BUF_TO_BLOCK(bp);
+
+
+		/*
+		 * mark the AG header regions as static metadata The BNO
+		 * btree block is the first block after the headers, so
+		 * it's location defines the size of region the static
+		 * metadata consumes.
+		 *
+		 * Note: unlike mkfs, we never have to account for log
+		 * space when growing the data regions
+		 */
+		rrec = XFS_RMAP_REC_ADDR(block, 1);
+		rrec->rm_startblock = 0;
+		rrec->rm_blockcount = cpu_to_be32(XFS_BNO_BLOCK(mp));
+		rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_FS);
+		rrec->rm_offset = 0;
+		be16_add_cpu(&block->bb_numrecs, 1);
+
+		/* account freespace btree root blocks */
+		rrec = XFS_RMAP_REC_ADDR(block, 2);
+		rrec->rm_startblock = cpu_to_be32(XFS_BNO_BLOCK(mp));
+		rrec->rm_blockcount = cpu_to_be32(2);
+		rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_AG);
+		rrec->rm_offset = 0;
+		be16_add_cpu(&block->bb_numrecs, 1);
+
+		/* account inode btree root blocks */
+		rrec = XFS_RMAP_REC_ADDR(block, 3);
+		rrec->rm_startblock = cpu_to_be32(XFS_IBT_BLOCK(mp));
+		rrec->rm_blockcount = cpu_to_be32(XFS_RMAP_BLOCK(mp) -
+						XFS_IBT_BLOCK(mp));
+		rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_INOBT);
+		rrec->rm_offset = 0;
+		be16_add_cpu(&block->bb_numrecs, 1);
+
+		/* account for rmap btree root */
+		rrec = XFS_RMAP_REC_ADDR(block, 4);
+		rrec->rm_startblock = cpu_to_be32(XFS_RMAP_BLOCK(mp));
+		rrec->rm_blockcount = cpu_to_be32(1);
+		rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_AG);
+		rrec->rm_offset = 0;
+		be16_add_cpu(&block->bb_numrecs, 1);
+
+		/* account for refc btree root */
+		if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+			rrec = XFS_RMAP_REC_ADDR(block, 5);
+			rrec->rm_startblock = cpu_to_be32(xfs_refc_block(mp));
+			rrec->rm_blockcount = cpu_to_be32(1);
+			rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_REFC);
+			rrec->rm_offset = 0;
+			be16_add_cpu(&block->bb_numrecs, 1);
+		}
+
+		error = xfs_bwrite(bp);
+		xfs_buf_relse(bp);
+		if (error)
+			goto out_error;
+	}
+
+	/*
+	 * INO btree root block
+	 */
+	bp = xfs_growfs_get_hdr_buf(mp,
+			XFS_AGB_TO_DADDR(mp, agno, XFS_IBT_BLOCK(mp)),
+			BTOBB(mp->m_sb.sb_blocksize), 0,
+			&xfs_inobt_buf_ops);
+	if (!bp) {
+		error = -ENOMEM;
+		goto out_error;
+	}
+
+	xfs_btree_init_block(mp, bp, XFS_BTNUM_INO , 0, 0, agno, 0);
+
+	error = xfs_bwrite(bp);
+	xfs_buf_relse(bp);
+	if (error)
+		goto out_error;
+
+	/*
+	 * FINO btree root block
+	 */
+	if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
+		bp = xfs_growfs_get_hdr_buf(mp,
+			XFS_AGB_TO_DADDR(mp, agno, XFS_FIBT_BLOCK(mp)),
+			BTOBB(mp->m_sb.sb_blocksize), 0,
+			&xfs_inobt_buf_ops);
+		if (!bp) {
+			error = -ENOMEM;
+			goto out_error;
+		}
+
+		xfs_btree_init_block(mp, bp, XFS_BTNUM_FINO,
+					     0, 0, agno, 0);
+
+		error = xfs_bwrite(bp);
+		xfs_buf_relse(bp);
+		if (error)
+			goto out_error;
+	}
+
+	/*
+	 * refcount btree root block
+	 */
+	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+		bp = xfs_growfs_get_hdr_buf(mp,
+			XFS_AGB_TO_DADDR(mp, agno, xfs_refc_block(mp)),
+			BTOBB(mp->m_sb.sb_blocksize), 0,
+			&xfs_refcountbt_buf_ops);
+		if (!bp) {
+			error = -ENOMEM;
+			goto out_error;
+		}
+
+		xfs_btree_init_block(mp, bp, XFS_BTNUM_REFC,
+				     0, 0, agno, 0);
+
+		error = xfs_bwrite(bp);
+		xfs_buf_relse(bp);
+		if (error)
+			goto out_error;
+	}
+
+out_error:
+	return error;
+}
+
 static int
 xfs_growfs_data_private(
 	xfs_mount_t		*mp,		/* mount point for filesystem */
 	xfs_growfs_data_t	*in)		/* growfs data input struct */
 {
 	xfs_agf_t		*agf;
-	struct xfs_agfl		*agfl;
 	xfs_agi_t		*agi;
 	xfs_agnumber_t		agno;
 	xfs_extlen_t		agsize;
-	xfs_extlen_t		tmpsize;
-	xfs_alloc_rec_t		*arec;
 	xfs_buf_t		*bp;
-	int			bucket;
 	int			dpct;
 	int			error, saved_error = 0;
 	xfs_agnumber_t		nagcount;
@@ -218,318 +542,19 @@ xfs_growfs_data_private(
 	 */
 	nfree = 0;
 	for (agno = nagcount - 1; agno >= oagcount; agno--, new -= agsize) {
-		__be32	*agfl_bno;
-
-		/*
-		 * AG freespace header block
-		 */
-		bp = xfs_growfs_get_hdr_buf(mp,
-				XFS_AG_DADDR(mp, agno, XFS_AGF_DADDR(mp)),
-				XFS_FSS_TO_BB(mp, 1), 0,
-				&xfs_agf_buf_ops);
-		if (!bp) {
-			error = -ENOMEM;
-			goto error0;
-		}
 
-		agf = XFS_BUF_TO_AGF(bp);
-		agf->agf_magicnum = cpu_to_be32(XFS_AGF_MAGIC);
-		agf->agf_versionnum = cpu_to_be32(XFS_AGF_VERSION);
-		agf->agf_seqno = cpu_to_be32(agno);
 		if (agno == nagcount - 1)
-			agsize =
-				nb -
+			agsize = nb -
 				(agno * (xfs_rfsblock_t)mp->m_sb.sb_agblocks);
 		else
 			agsize = mp->m_sb.sb_agblocks;
-		agf->agf_length = cpu_to_be32(agsize);
-		agf->agf_roots[XFS_BTNUM_BNOi] = cpu_to_be32(XFS_BNO_BLOCK(mp));
-		agf->agf_roots[XFS_BTNUM_CNTi] = cpu_to_be32(XFS_CNT_BLOCK(mp));
-		agf->agf_levels[XFS_BTNUM_BNOi] = cpu_to_be32(1);
-		agf->agf_levels[XFS_BTNUM_CNTi] = cpu_to_be32(1);
-		if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
-			agf->agf_roots[XFS_BTNUM_RMAPi] =
-						cpu_to_be32(XFS_RMAP_BLOCK(mp));
-			agf->agf_levels[XFS_BTNUM_RMAPi] = cpu_to_be32(1);
-			agf->agf_rmap_blocks = cpu_to_be32(1);
-		}
-
-		agf->agf_flfirst = cpu_to_be32(1);
-		agf->agf_fllast = 0;
-		agf->agf_flcount = 0;
-		tmpsize = agsize - mp->m_ag_prealloc_blocks;
-		agf->agf_freeblks = cpu_to_be32(tmpsize);
-		agf->agf_longest = cpu_to_be32(tmpsize);
-		if (xfs_sb_version_hascrc(&mp->m_sb))
-			uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid);
-		if (xfs_sb_version_hasreflink(&mp->m_sb)) {
-			agf->agf_refcount_root = cpu_to_be32(
-					xfs_refc_block(mp));
-			agf->agf_refcount_level = cpu_to_be32(1);
-			agf->agf_refcount_blocks = cpu_to_be32(1);
-		}
-
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-		if (error)
-			goto error0;
-
-		/*
-		 * AG freelist header block
-		 */
-		bp = xfs_growfs_get_hdr_buf(mp,
-				XFS_AG_DADDR(mp, agno, XFS_AGFL_DADDR(mp)),
-				XFS_FSS_TO_BB(mp, 1), 0,
-				&xfs_agfl_buf_ops);
-		if (!bp) {
-			error = -ENOMEM;
-			goto error0;
-		}
-
-		agfl = XFS_BUF_TO_AGFL(bp);
-		if (xfs_sb_version_hascrc(&mp->m_sb)) {
-			agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
-			agfl->agfl_seqno = cpu_to_be32(agno);
-			uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
-		}
-
-		agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, bp);
-		for (bucket = 0; bucket < XFS_AGFL_SIZE(mp); bucket++)
-			agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);
 
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
+		error = xfs_grow_ag_headers(mp, agno, agsize, &nfree);
 		if (error)
 			goto error0;
-
-		/*
-		 * AG inode header block
-		 */
-		bp = xfs_growfs_get_hdr_buf(mp,
-				XFS_AG_DADDR(mp, agno, XFS_AGI_DADDR(mp)),
-				XFS_FSS_TO_BB(mp, 1), 0,
-				&xfs_agi_buf_ops);
-		if (!bp) {
-			error = -ENOMEM;
-			goto error0;
-		}
-
-		agi = XFS_BUF_TO_AGI(bp);
-		agi->agi_magicnum = cpu_to_be32(XFS_AGI_MAGIC);
-		agi->agi_versionnum = cpu_to_be32(XFS_AGI_VERSION);
-		agi->agi_seqno = cpu_to_be32(agno);
-		agi->agi_length = cpu_to_be32(agsize);
-		agi->agi_count = 0;
-		agi->agi_root = cpu_to_be32(XFS_IBT_BLOCK(mp));
-		agi->agi_level = cpu_to_be32(1);
-		agi->agi_freecount = 0;
-		agi->agi_newino = cpu_to_be32(NULLAGINO);
-		agi->agi_dirino = cpu_to_be32(NULLAGINO);
-		if (xfs_sb_version_hascrc(&mp->m_sb))
-			uuid_copy(&agi->agi_uuid, &mp->m_sb.sb_meta_uuid);
-		if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
-			agi->agi_free_root = cpu_to_be32(XFS_FIBT_BLOCK(mp));
-			agi->agi_free_level = cpu_to_be32(1);
-		}
-		for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++)
-			agi->agi_unlinked[bucket] = cpu_to_be32(NULLAGINO);
-
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-		if (error)
-			goto error0;
-
-		/*
-		 * BNO btree root block
-		 */
-		bp = xfs_growfs_get_hdr_buf(mp,
-				XFS_AGB_TO_DADDR(mp, agno, XFS_BNO_BLOCK(mp)),
-				BTOBB(mp->m_sb.sb_blocksize), 0,
-				&xfs_allocbt_buf_ops);
-
-		if (!bp) {
-			error = -ENOMEM;
-			goto error0;
-		}
-
-		xfs_btree_init_block(mp, bp, XFS_BTNUM_BNO, 0, 1, agno, 0);
-
-		arec = XFS_ALLOC_REC_ADDR(mp, XFS_BUF_TO_BLOCK(bp), 1);
-		arec->ar_startblock = cpu_to_be32(mp->m_ag_prealloc_blocks);
-		arec->ar_blockcount = cpu_to_be32(
-			agsize - be32_to_cpu(arec->ar_startblock));
-
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-		if (error)
-			goto error0;
-
-		/*
-		 * CNT btree root block
-		 */
-		bp = xfs_growfs_get_hdr_buf(mp,
-				XFS_AGB_TO_DADDR(mp, agno, XFS_CNT_BLOCK(mp)),
-				BTOBB(mp->m_sb.sb_blocksize), 0,
-				&xfs_allocbt_buf_ops);
-		if (!bp) {
-			error = -ENOMEM;
-			goto error0;
-		}
-
-		xfs_btree_init_block(mp, bp, XFS_BTNUM_CNT, 0, 1, agno, 0);
-
-		arec = XFS_ALLOC_REC_ADDR(mp, XFS_BUF_TO_BLOCK(bp), 1);
-		arec->ar_startblock = cpu_to_be32(mp->m_ag_prealloc_blocks);
-		arec->ar_blockcount = cpu_to_be32(
-			agsize - be32_to_cpu(arec->ar_startblock));
-		nfree += be32_to_cpu(arec->ar_blockcount);
-
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-		if (error)
-			goto error0;
-
-		/* RMAP btree root block */
-		if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
-			struct xfs_rmap_rec	*rrec;
-			struct xfs_btree_block	*block;
-
-			bp = xfs_growfs_get_hdr_buf(mp,
-				XFS_AGB_TO_DADDR(mp, agno, XFS_RMAP_BLOCK(mp)),
-				BTOBB(mp->m_sb.sb_blocksize), 0,
-				&xfs_rmapbt_buf_ops);
-			if (!bp) {
-				error = -ENOMEM;
-				goto error0;
-			}
-
-			xfs_btree_init_block(mp, bp, XFS_BTNUM_RMAP, 0, 0,
-						agno, 0);
-			block = XFS_BUF_TO_BLOCK(bp);
-
-
-			/*
-			 * mark the AG header regions as static metadata The BNO
-			 * btree block is the first block after the headers, so
-			 * it's location defines the size of region the static
-			 * metadata consumes.
-			 *
-			 * Note: unlike mkfs, we never have to account for log
-			 * space when growing the data regions
-			 */
-			rrec = XFS_RMAP_REC_ADDR(block, 1);
-			rrec->rm_startblock = 0;
-			rrec->rm_blockcount = cpu_to_be32(XFS_BNO_BLOCK(mp));
-			rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_FS);
-			rrec->rm_offset = 0;
-			be16_add_cpu(&block->bb_numrecs, 1);
-
-			/* account freespace btree root blocks */
-			rrec = XFS_RMAP_REC_ADDR(block, 2);
-			rrec->rm_startblock = cpu_to_be32(XFS_BNO_BLOCK(mp));
-			rrec->rm_blockcount = cpu_to_be32(2);
-			rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_AG);
-			rrec->rm_offset = 0;
-			be16_add_cpu(&block->bb_numrecs, 1);
-
-			/* account inode btree root blocks */
-			rrec = XFS_RMAP_REC_ADDR(block, 3);
-			rrec->rm_startblock = cpu_to_be32(XFS_IBT_BLOCK(mp));
-			rrec->rm_blockcount = cpu_to_be32(XFS_RMAP_BLOCK(mp) -
-							XFS_IBT_BLOCK(mp));
-			rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_INOBT);
-			rrec->rm_offset = 0;
-			be16_add_cpu(&block->bb_numrecs, 1);
-
-			/* account for rmap btree root */
-			rrec = XFS_RMAP_REC_ADDR(block, 4);
-			rrec->rm_startblock = cpu_to_be32(XFS_RMAP_BLOCK(mp));
-			rrec->rm_blockcount = cpu_to_be32(1);
-			rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_AG);
-			rrec->rm_offset = 0;
-			be16_add_cpu(&block->bb_numrecs, 1);
-
-			/* account for refc btree root */
-			if (xfs_sb_version_hasreflink(&mp->m_sb)) {
-				rrec = XFS_RMAP_REC_ADDR(block, 5);
-				rrec->rm_startblock = cpu_to_be32(
-						xfs_refc_block(mp));
-				rrec->rm_blockcount = cpu_to_be32(1);
-				rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_REFC);
-				rrec->rm_offset = 0;
-				be16_add_cpu(&block->bb_numrecs, 1);
-			}
-
-			error = xfs_bwrite(bp);
-			xfs_buf_relse(bp);
-			if (error)
-				goto error0;
-		}
-
-		/*
-		 * INO btree root block
-		 */
-		bp = xfs_growfs_get_hdr_buf(mp,
-				XFS_AGB_TO_DADDR(mp, agno, XFS_IBT_BLOCK(mp)),
-				BTOBB(mp->m_sb.sb_blocksize), 0,
-				&xfs_inobt_buf_ops);
-		if (!bp) {
-			error = -ENOMEM;
-			goto error0;
-		}
-
-		xfs_btree_init_block(mp, bp, XFS_BTNUM_INO , 0, 0, agno, 0);
-
-		error = xfs_bwrite(bp);
-		xfs_buf_relse(bp);
-		if (error)
-			goto error0;
-
-		/*
-		 * FINO btree root block
-		 */
-		if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
-			bp = xfs_growfs_get_hdr_buf(mp,
-				XFS_AGB_TO_DADDR(mp, agno, XFS_FIBT_BLOCK(mp)),
-				BTOBB(mp->m_sb.sb_blocksize), 0,
-				&xfs_inobt_buf_ops);
-			if (!bp) {
-				error = -ENOMEM;
-				goto error0;
-			}
-
-			xfs_btree_init_block(mp, bp, XFS_BTNUM_FINO,
-						     0, 0, agno, 0);
-
-			error = xfs_bwrite(bp);
-			xfs_buf_relse(bp);
-			if (error)
-				goto error0;
-		}
-
-		/*
-		 * refcount btree root block
-		 */
-		if (xfs_sb_version_hasreflink(&mp->m_sb)) {
-			bp = xfs_growfs_get_hdr_buf(mp,
-				XFS_AGB_TO_DADDR(mp, agno, xfs_refc_block(mp)),
-				BTOBB(mp->m_sb.sb_blocksize), 0,
-				&xfs_refcountbt_buf_ops);
-			if (!bp) {
-				error = -ENOMEM;
-				goto error0;
-			}
-
-			xfs_btree_init_block(mp, bp, XFS_BTNUM_REFC,
-					     0, 0, agno, 0);
-
-			error = xfs_bwrite(bp);
-			xfs_buf_relse(bp);
-			if (error)
-				goto error0;
-		}
 	}
 	xfs_trans_agblocks_delta(tp, nfree);
+
 	/*
 	 * There are new blocks in the old last a.g.
 	 */
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 02/14] xfs: convert growfs AG header init to use buffer lists
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
  2017-10-26  8:33 ` [PATCH 01/14] xfs: factor out AG header initialisation from growfs core Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26  8:33 ` [PATCH 03/14] xfs: factor ag btree reoot block initialisation Dave Chinner
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We currently write all new AG headers synchronously, which can be
slow for large grow operations. All we really need to do is ensure
all the headers are on disk before we run the growfs transaction, so
convert this to a buffer list and a delayed write operation. We
block waiting for the delayed write buffer submission to complete,
so this will fulfill the requirement to have all the buffers written
correctly before proceeding.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_fsops.c | 74 ++++++++++++++++++++++++------------------------------
 1 file changed, 33 insertions(+), 41 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 487a9dca1170..d21181d938dd 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -158,7 +158,8 @@ xfs_grow_ag_headers(
 	struct xfs_mount	*mp,
 	xfs_agnumber_t		agno,
 	xfs_extlen_t		agsize,
-	xfs_rfsblock_t		*nfree)
+	xfs_rfsblock_t		*nfree,
+	struct list_head	*buffer_list)
 {
 	struct xfs_agf		*agf;
 	struct xfs_agi		*agi;
@@ -212,11 +213,8 @@ xfs_grow_ag_headers(
 		agf->agf_refcount_level = cpu_to_be32(1);
 		agf->agf_refcount_blocks = cpu_to_be32(1);
 	}
-
-	error = xfs_bwrite(bp);
+	xfs_buf_delwri_queue(bp, buffer_list);
 	xfs_buf_relse(bp);
-	if (error)
-		goto out_error;
 
 	/*
 	 * AG freelist header block
@@ -241,10 +239,8 @@ xfs_grow_ag_headers(
 	for (bucket = 0; bucket < XFS_AGFL_SIZE(mp); bucket++)
 		agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);
 
-	error = xfs_bwrite(bp);
+	xfs_buf_delwri_queue(bp, buffer_list);
 	xfs_buf_relse(bp);
-	if (error)
-		goto out_error;
 
 	/*
 	 * AG inode header block
@@ -278,10 +274,8 @@ xfs_grow_ag_headers(
 	for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++)
 		agi->agi_unlinked[bucket] = cpu_to_be32(NULLAGINO);
 
-	error = xfs_bwrite(bp);
+	xfs_buf_delwri_queue(bp, buffer_list);
 	xfs_buf_relse(bp);
-	if (error)
-		goto out_error;
 
 	/*
 	 * BNO btree root block
@@ -303,10 +297,8 @@ xfs_grow_ag_headers(
 	arec->ar_blockcount = cpu_to_be32(
 		agsize - be32_to_cpu(arec->ar_startblock));
 
-	error = xfs_bwrite(bp);
+	xfs_buf_delwri_queue(bp, buffer_list);
 	xfs_buf_relse(bp);
-	if (error)
-		goto out_error;
 
 	/*
 	 * CNT btree root block
@@ -328,10 +320,8 @@ xfs_grow_ag_headers(
 		agsize - be32_to_cpu(arec->ar_startblock));
 	*nfree += be32_to_cpu(arec->ar_blockcount);
 
-	error = xfs_bwrite(bp);
+	xfs_buf_delwri_queue(bp, buffer_list);
 	xfs_buf_relse(bp);
-	if (error)
-		goto out_error;
 
 	/* RMAP btree root block */
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
@@ -403,10 +393,8 @@ xfs_grow_ag_headers(
 			be16_add_cpu(&block->bb_numrecs, 1);
 		}
 
-		error = xfs_bwrite(bp);
+		xfs_buf_delwri_queue(bp, buffer_list);
 		xfs_buf_relse(bp);
-		if (error)
-			goto out_error;
 	}
 
 	/*
@@ -422,11 +410,8 @@ xfs_grow_ag_headers(
 	}
 
 	xfs_btree_init_block(mp, bp, XFS_BTNUM_INO , 0, 0, agno, 0);
-
-	error = xfs_bwrite(bp);
+	xfs_buf_delwri_queue(bp, buffer_list);
 	xfs_buf_relse(bp);
-	if (error)
-		goto out_error;
 
 	/*
 	 * FINO btree root block
@@ -441,13 +426,9 @@ xfs_grow_ag_headers(
 			goto out_error;
 		}
 
-		xfs_btree_init_block(mp, bp, XFS_BTNUM_FINO,
-					     0, 0, agno, 0);
-
-		error = xfs_bwrite(bp);
+		xfs_btree_init_block(mp, bp, XFS_BTNUM_FINO, 0, 0, agno, 0);
+		xfs_buf_delwri_queue(bp, buffer_list);
 		xfs_buf_relse(bp);
-		if (error)
-			goto out_error;
 	}
 
 	/*
@@ -463,13 +444,9 @@ xfs_grow_ag_headers(
 			goto out_error;
 		}
 
-		xfs_btree_init_block(mp, bp, XFS_BTNUM_REFC,
-				     0, 0, agno, 0);
-
-		error = xfs_bwrite(bp);
+		xfs_btree_init_block(mp, bp, XFS_BTNUM_REFC, 0, 0, agno, 0);
+		xfs_buf_delwri_queue(bp, buffer_list);
 		xfs_buf_relse(bp);
-		if (error)
-			goto out_error;
 	}
 
 out_error:
@@ -496,6 +473,7 @@ xfs_growfs_data_private(
 	xfs_agnumber_t		oagcount;
 	int			pct;
 	xfs_trans_t		*tp;
+	LIST_HEAD		(buffer_list);
 
 	nb = in->newblocks;
 	pct = in->imaxpct;
@@ -536,9 +514,16 @@ xfs_growfs_data_private(
 		return error;
 
 	/*
-	 * Write new AG headers to disk. Non-transactional, but written
-	 * synchronously so they are completed prior to the growfs transaction
-	 * being logged.
+	 * Write new AG headers to disk. Non-transactional, but need to be
+	 * written and completed prior to the growfs transaction being logged.
+	 * To do this, we use a delayed write buffer list and wait for
+	 * submission and IO completion of the list as a whole. This allows the
+	 * IO subsystem to merge all the AG headers in a single AG into a single
+	 * IO and hide most of the latency of the IO from us.
+	 *
+	 * This also means that if we get an error whilst building the buffer
+	 * list to write, we can cancel the entire list without having written
+	 * anything.
 	 */
 	nfree = 0;
 	for (agno = nagcount - 1; agno >= oagcount; agno--, new -= agsize) {
@@ -549,10 +534,17 @@ xfs_growfs_data_private(
 		else
 			agsize = mp->m_sb.sb_agblocks;
 
-		error = xfs_grow_ag_headers(mp, agno, agsize, &nfree);
-		if (error)
+		error = xfs_grow_ag_headers(mp, agno, agsize, &nfree,
+					    &buffer_list);
+		if (error) {
+			xfs_buf_delwri_cancel(&buffer_list);
 			goto error0;
+		}
 	}
+	error = xfs_buf_delwri_submit(&buffer_list);
+	if (error)
+		goto error0;
+
 	xfs_trans_agblocks_delta(tp, nfree);
 
 	/*
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 03/14] xfs: factor ag btree reoot block initialisation
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
  2017-10-26  8:33 ` [PATCH 01/14] xfs: factor out AG header initialisation from growfs core Dave Chinner
  2017-10-26  8:33 ` [PATCH 02/14] xfs: convert growfs AG header init to use buffer lists Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26  8:33 ` [PATCH 04/14] xfs: turn ag header initialisation into a table driven operation Dave Chinner
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Cookie cutter code, easily factored.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_fsops.c | 491 +++++++++++++++++++++++++++++------------------------
 1 file changed, 270 insertions(+), 221 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index d21181d938dd..6744865da887 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -148,46 +148,146 @@ xfs_growfs_get_hdr_buf(
 	return bp;
 }
 
+struct aghdr_init_data {
+	/* per ag data */
+	xfs_agblock_t		agno;
+	xfs_extlen_t		agsize;
+	struct list_head	buffer_list;
+	xfs_rfsblock_t		nfree;
+
+	/* per header data */
+	xfs_daddr_t		daddr;
+	size_t			numblks;
+	xfs_btnum_t		type;
+	int			numrecs;
+};
+
 /*
- * Write new AG headers to disk. Non-transactional, but written
- * synchronously so they are completed prior to the growfs transaction
- * being logged.
+ * Generic btree root block init function
  */
-static int
-xfs_grow_ag_headers(
+static void
+xfs_btroot_init(
 	struct xfs_mount	*mp,
-	xfs_agnumber_t		agno,
-	xfs_extlen_t		agsize,
-	xfs_rfsblock_t		*nfree,
-	struct list_head	*buffer_list)
+	struct xfs_buf		*bp,
+	struct aghdr_init_data	*id)
+{
+	xfs_btree_init_block(mp, bp, id->type, 0, id->numrecs, id->agno, 0);
+}
+
+/*
+ * Alloc btree root block init functions
+ */
+static void
+xfs_bnoroot_init(
+	struct xfs_mount	*mp,
+	struct xfs_buf		*bp,
+	struct aghdr_init_data	*id)
 {
-	struct xfs_agf		*agf;
-	struct xfs_agi		*agi;
-	struct xfs_agfl		*agfl;
-	__be32			*agfl_bno;
 	xfs_alloc_rec_t		*arec;
-	struct xfs_buf		*bp;
-	int			bucket;
-	xfs_extlen_t		tmpsize;
-	int			error = 0;
+
+	xfs_btree_init_block(mp, bp, id->type, 0, id->numrecs, id->agno, 0);
+	arec = XFS_ALLOC_REC_ADDR(mp, XFS_BUF_TO_BLOCK(bp), 1);
+	arec->ar_startblock = cpu_to_be32(mp->m_ag_prealloc_blocks);
+	arec->ar_blockcount = cpu_to_be32(id->agsize -
+					  be32_to_cpu(arec->ar_startblock));
+}
+
+static void
+xfs_cntroot_init(
+	struct xfs_mount	*mp,
+	struct xfs_buf		*bp,
+	struct aghdr_init_data	*id)
+{
+	xfs_alloc_rec_t		*arec;
+
+	xfs_btree_init_block(mp, bp, id->type, 0, id->numrecs, id->agno, 0);
+	arec = XFS_ALLOC_REC_ADDR(mp, XFS_BUF_TO_BLOCK(bp), 1);
+	arec->ar_startblock = cpu_to_be32(mp->m_ag_prealloc_blocks);
+	arec->ar_blockcount = cpu_to_be32(id->agsize -
+					  be32_to_cpu(arec->ar_startblock));
+	id->nfree += be32_to_cpu(arec->ar_blockcount);
+}
+
+/*
+ * Reverse map root block init
+ */
+static void
+xfs_rmaproot_init(
+	struct xfs_mount	*mp,
+	struct xfs_buf		*bp,
+	struct aghdr_init_data	*id)
+{
+	struct xfs_btree_block	*block = XFS_BUF_TO_BLOCK(bp);
+	struct xfs_rmap_rec	*rrec;
+
+	xfs_btree_init_block(mp, bp, id->type, 0, id->numrecs, id->agno, 0);
 
 	/*
-	 * AG freespace header block
+	 * mark the AG header regions as static metadata The BNO
+	 * btree block is the first block after the headers, so
+	 * it's location defines the size of region the static
+	 * metadata consumes.
+	 *
+	 * Note: unlike mkfs, we never have to account for log
+	 * space when growing the data regions
 	 */
-	bp = xfs_growfs_get_hdr_buf(mp,
-			XFS_AG_DADDR(mp, agno, XFS_AGF_DADDR(mp)),
-			XFS_FSS_TO_BB(mp, 1), 0,
-			&xfs_agf_buf_ops);
-	if (!bp) {
-		error = -ENOMEM;
-		goto out_error;
+	rrec = XFS_RMAP_REC_ADDR(block, 1);
+	rrec->rm_startblock = 0;
+	rrec->rm_blockcount = cpu_to_be32(XFS_BNO_BLOCK(mp));
+	rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_FS);
+	rrec->rm_offset = 0;
+	be16_add_cpu(&block->bb_numrecs, 1);
+
+	/* account freespace btree root blocks */
+	rrec = XFS_RMAP_REC_ADDR(block, 2);
+	rrec->rm_startblock = cpu_to_be32(XFS_BNO_BLOCK(mp));
+	rrec->rm_blockcount = cpu_to_be32(2);
+	rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_AG);
+	rrec->rm_offset = 0;
+	be16_add_cpu(&block->bb_numrecs, 1);
+
+	/* account inode btree root blocks */
+	rrec = XFS_RMAP_REC_ADDR(block, 3);
+	rrec->rm_startblock = cpu_to_be32(XFS_IBT_BLOCK(mp));
+	rrec->rm_blockcount = cpu_to_be32(XFS_RMAP_BLOCK(mp) -
+					  XFS_IBT_BLOCK(mp));
+	rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_INOBT);
+	rrec->rm_offset = 0;
+	be16_add_cpu(&block->bb_numrecs, 1);
+
+	/* account for rmap btree root */
+	rrec = XFS_RMAP_REC_ADDR(block, 4);
+	rrec->rm_startblock = cpu_to_be32(XFS_RMAP_BLOCK(mp));
+	rrec->rm_blockcount = cpu_to_be32(1);
+	rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_AG);
+	rrec->rm_offset = 0;
+	be16_add_cpu(&block->bb_numrecs, 1);
+
+	/* account for refc btree root */
+	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+		rrec = XFS_RMAP_REC_ADDR(block, 5);
+		rrec->rm_startblock = cpu_to_be32(xfs_refc_block(mp));
+		rrec->rm_blockcount = cpu_to_be32(1);
+		rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_REFC);
+		rrec->rm_offset = 0;
+		be16_add_cpu(&block->bb_numrecs, 1);
 	}
+}
+
+
+static void
+xfs_agfblock_init(
+	struct xfs_mount	*mp,
+	struct xfs_buf		*bp,
+	struct aghdr_init_data	*id)
+{
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(bp);
+	xfs_extlen_t		tmpsize;
 
-	agf = XFS_BUF_TO_AGF(bp);
 	agf->agf_magicnum = cpu_to_be32(XFS_AGF_MAGIC);
 	agf->agf_versionnum = cpu_to_be32(XFS_AGF_VERSION);
-	agf->agf_seqno = cpu_to_be32(agno);
-	agf->agf_length = cpu_to_be32(agsize);
+	agf->agf_seqno = cpu_to_be32(id->agno);
+	agf->agf_length = cpu_to_be32(id->agsize);
 	agf->agf_roots[XFS_BTNUM_BNOi] = cpu_to_be32(XFS_BNO_BLOCK(mp));
 	agf->agf_roots[XFS_BTNUM_CNTi] = cpu_to_be32(XFS_CNT_BLOCK(mp));
 	agf->agf_levels[XFS_BTNUM_BNOi] = cpu_to_be32(1);
@@ -202,7 +302,7 @@ xfs_grow_ag_headers(
 	agf->agf_flfirst = cpu_to_be32(1);
 	agf->agf_fllast = 0;
 	agf->agf_flcount = 0;
-	tmpsize = agsize - mp->m_ag_prealloc_blocks;
+	tmpsize = id->agsize - mp->m_ag_prealloc_blocks;
 	agf->agf_freeblks = cpu_to_be32(tmpsize);
 	agf->agf_longest = cpu_to_be32(tmpsize);
 	if (xfs_sb_version_hascrc(&mp->m_sb))
@@ -213,52 +313,42 @@ xfs_grow_ag_headers(
 		agf->agf_refcount_level = cpu_to_be32(1);
 		agf->agf_refcount_blocks = cpu_to_be32(1);
 	}
-	xfs_buf_delwri_queue(bp, buffer_list);
-	xfs_buf_relse(bp);
+}
 
-	/*
-	 * AG freelist header block
-	 */
-	bp = xfs_growfs_get_hdr_buf(mp,
-			XFS_AG_DADDR(mp, agno, XFS_AGFL_DADDR(mp)),
-			XFS_FSS_TO_BB(mp, 1), 0,
-			&xfs_agfl_buf_ops);
-	if (!bp) {
-		error = -ENOMEM;
-		goto out_error;
-	}
+static void
+xfs_agflblock_init(
+	struct xfs_mount	*mp,
+	struct xfs_buf		*bp,
+	struct aghdr_init_data	*id)
+{
+	struct xfs_agfl		*agfl = XFS_BUF_TO_AGFL(bp);
+	__be32			*agfl_bno;
+	int			bucket;
 
-	agfl = XFS_BUF_TO_AGFL(bp);
 	if (xfs_sb_version_hascrc(&mp->m_sb)) {
 		agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
-		agfl->agfl_seqno = cpu_to_be32(agno);
+		agfl->agfl_seqno = cpu_to_be32(id->agno);
 		uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
 	}
 
 	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, bp);
 	for (bucket = 0; bucket < XFS_AGFL_SIZE(mp); bucket++)
 		agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);
+}
 
-	xfs_buf_delwri_queue(bp, buffer_list);
-	xfs_buf_relse(bp);
-
-	/*
-	 * AG inode header block
-	 */
-	bp = xfs_growfs_get_hdr_buf(mp,
-			XFS_AG_DADDR(mp, agno, XFS_AGI_DADDR(mp)),
-			XFS_FSS_TO_BB(mp, 1), 0,
-			&xfs_agi_buf_ops);
-	if (!bp) {
-		error = -ENOMEM;
-		goto out_error;
-	}
+static void
+xfs_agiblock_init(
+	struct xfs_mount	*mp,
+	struct xfs_buf		*bp,
+	struct aghdr_init_data	*id)
+{
+	struct xfs_agi		*agi = XFS_BUF_TO_AGI(bp);
+	int			bucket;
 
-	agi = XFS_BUF_TO_AGI(bp);
 	agi->agi_magicnum = cpu_to_be32(XFS_AGI_MAGIC);
 	agi->agi_versionnum = cpu_to_be32(XFS_AGI_VERSION);
-	agi->agi_seqno = cpu_to_be32(agno);
-	agi->agi_length = cpu_to_be32(agsize);
+	agi->agi_seqno = cpu_to_be32(id->agno);
+	agi->agi_length = cpu_to_be32(id->agsize);
 	agi->agi_count = 0;
 	agi->agi_root = cpu_to_be32(XFS_IBT_BLOCK(mp));
 	agi->agi_level = cpu_to_be32(1);
@@ -273,180 +363,139 @@ xfs_grow_ag_headers(
 	}
 	for (bucket = 0; bucket < XFS_AGI_UNLINKED_BUCKETS; bucket++)
 		agi->agi_unlinked[bucket] = cpu_to_be32(NULLAGINO);
+}
 
-	xfs_buf_delwri_queue(bp, buffer_list);
-	xfs_buf_relse(bp);
-
-	/*
-	 * BNO btree root block
-	 */
-	bp = xfs_growfs_get_hdr_buf(mp,
-			XFS_AGB_TO_DADDR(mp, agno, XFS_BNO_BLOCK(mp)),
-			BTOBB(mp->m_sb.sb_blocksize), 0,
-			&xfs_allocbt_buf_ops);
+static int
+xfs_growfs_init_aghdr(
+	struct xfs_mount	*mp,
+	struct aghdr_init_data	*id,
+	void			(*work)(struct xfs_mount *, struct xfs_buf *,
+					struct aghdr_init_data *),
+	const struct xfs_buf_ops *ops)
 
-	if (!bp) {
-		error = -ENOMEM;
-		goto out_error;
-	}
+{
+	struct xfs_buf		*bp;
 
-	xfs_btree_init_block(mp, bp, XFS_BTNUM_BNO, 0, 1, agno, 0);
+	bp = xfs_growfs_get_hdr_buf(mp, id->daddr, id->numblks, 0, ops);
+	if (!bp)
+		return -ENOMEM;
 
-	arec = XFS_ALLOC_REC_ADDR(mp, XFS_BUF_TO_BLOCK(bp), 1);
-	arec->ar_startblock = cpu_to_be32(mp->m_ag_prealloc_blocks);
-	arec->ar_blockcount = cpu_to_be32(
-		agsize - be32_to_cpu(arec->ar_startblock));
+	(*work)(mp, bp, id);
 
-	xfs_buf_delwri_queue(bp, buffer_list);
+	xfs_buf_delwri_queue(bp, &id->buffer_list);
 	xfs_buf_relse(bp);
+	return 0;
+}
 
-	/*
-	 * CNT btree root block
-	 */
-	bp = xfs_growfs_get_hdr_buf(mp,
-			XFS_AGB_TO_DADDR(mp, agno, XFS_CNT_BLOCK(mp)),
-			BTOBB(mp->m_sb.sb_blocksize), 0,
-			&xfs_allocbt_buf_ops);
-	if (!bp) {
-		error = -ENOMEM;
-		goto out_error;
-	}
-
-	xfs_btree_init_block(mp, bp, XFS_BTNUM_CNT, 0, 1, agno, 0);
+/*
+ * Write new AG headers to disk. Non-transactional, but written
+ * synchronously so they are completed prior to the growfs transaction
+ * being logged.
+ */
+static int
+xfs_grow_ag_headers(
+	struct xfs_mount	*mp,
+	struct aghdr_init_data	*id)
 
-	arec = XFS_ALLOC_REC_ADDR(mp, XFS_BUF_TO_BLOCK(bp), 1);
-	arec->ar_startblock = cpu_to_be32(mp->m_ag_prealloc_blocks);
-	arec->ar_blockcount = cpu_to_be32(
-		agsize - be32_to_cpu(arec->ar_startblock));
-	*nfree += be32_to_cpu(arec->ar_blockcount);
+{
+	int			error = 0;
 
-	xfs_buf_delwri_queue(bp, buffer_list);
-	xfs_buf_relse(bp);
+	/* AG freespace header block */
+	id->daddr = XFS_AG_DADDR(mp, id->agno, XFS_AGF_DADDR(mp));
+	id->numblks = XFS_FSS_TO_BB(mp, 1);
+	error = xfs_growfs_init_aghdr(mp, id, xfs_agfblock_init,
+					&xfs_agf_buf_ops);
+	if (error)
+		goto out_error;
 
-	/* RMAP btree root block */
-	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
-		struct xfs_rmap_rec	*rrec;
-		struct xfs_btree_block	*block;
-
-		bp = xfs_growfs_get_hdr_buf(mp,
-			XFS_AGB_TO_DADDR(mp, agno, XFS_RMAP_BLOCK(mp)),
-			BTOBB(mp->m_sb.sb_blocksize), 0,
-			&xfs_rmapbt_buf_ops);
-		if (!bp) {
-			error = -ENOMEM;
-			goto out_error;
-		}
+	/* AG freelist header block */
+	id->daddr = XFS_AG_DADDR(mp, id->agno, XFS_AGFL_DADDR(mp));
+	id->numblks = XFS_FSS_TO_BB(mp, 1);
+	error = xfs_growfs_init_aghdr(mp, id, xfs_agflblock_init,
+					&xfs_agfl_buf_ops);
+	if (error)
+		goto out_error;
 
-		xfs_btree_init_block(mp, bp, XFS_BTNUM_RMAP, 0, 0,
-					agno, 0);
-		block = XFS_BUF_TO_BLOCK(bp);
+	/* AG inode header block */
+	id->daddr = XFS_AG_DADDR(mp, id->agno, XFS_AGI_DADDR(mp));
+	id->numblks = XFS_FSS_TO_BB(mp, 1);
+	error = xfs_growfs_init_aghdr(mp, id, xfs_agiblock_init,
+					&xfs_agi_buf_ops);
+	if (error)
+		goto out_error;
 
 
-		/*
-		 * mark the AG header regions as static metadata The BNO
-		 * btree block is the first block after the headers, so
-		 * it's location defines the size of region the static
-		 * metadata consumes.
-		 *
-		 * Note: unlike mkfs, we never have to account for log
-		 * space when growing the data regions
-		 */
-		rrec = XFS_RMAP_REC_ADDR(block, 1);
-		rrec->rm_startblock = 0;
-		rrec->rm_blockcount = cpu_to_be32(XFS_BNO_BLOCK(mp));
-		rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_FS);
-		rrec->rm_offset = 0;
-		be16_add_cpu(&block->bb_numrecs, 1);
-
-		/* account freespace btree root blocks */
-		rrec = XFS_RMAP_REC_ADDR(block, 2);
-		rrec->rm_startblock = cpu_to_be32(XFS_BNO_BLOCK(mp));
-		rrec->rm_blockcount = cpu_to_be32(2);
-		rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_AG);
-		rrec->rm_offset = 0;
-		be16_add_cpu(&block->bb_numrecs, 1);
+	/* BNO btree root block */
+	id->daddr = XFS_AGB_TO_DADDR(mp, id->agno, XFS_BNO_BLOCK(mp));
+	id->numblks = BTOBB(mp->m_sb.sb_blocksize);
+	id->type = XFS_BTNUM_BNO;
+	id->numrecs = 1;
+	error = xfs_growfs_init_aghdr(mp, id, xfs_bnoroot_init,
+				   &xfs_allocbt_buf_ops);
+	if (error)
+		goto out_error;
 
-		/* account inode btree root blocks */
-		rrec = XFS_RMAP_REC_ADDR(block, 3);
-		rrec->rm_startblock = cpu_to_be32(XFS_IBT_BLOCK(mp));
-		rrec->rm_blockcount = cpu_to_be32(XFS_RMAP_BLOCK(mp) -
-						XFS_IBT_BLOCK(mp));
-		rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_INOBT);
-		rrec->rm_offset = 0;
-		be16_add_cpu(&block->bb_numrecs, 1);
 
-		/* account for rmap btree root */
-		rrec = XFS_RMAP_REC_ADDR(block, 4);
-		rrec->rm_startblock = cpu_to_be32(XFS_RMAP_BLOCK(mp));
-		rrec->rm_blockcount = cpu_to_be32(1);
-		rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_AG);
-		rrec->rm_offset = 0;
-		be16_add_cpu(&block->bb_numrecs, 1);
+	/* CNT btree root block */
+	id->daddr = XFS_AGB_TO_DADDR(mp, id->agno, XFS_CNT_BLOCK(mp));
+	id->numblks = BTOBB(mp->m_sb.sb_blocksize);
+	id->type = XFS_BTNUM_CNT;
+	id->numrecs = 1;
+	error = xfs_growfs_init_aghdr(mp, id, xfs_cntroot_init,
+				   &xfs_allocbt_buf_ops);
+	if (error)
+		goto out_error;
 
-		/* account for refc btree root */
-		if (xfs_sb_version_hasreflink(&mp->m_sb)) {
-			rrec = XFS_RMAP_REC_ADDR(block, 5);
-			rrec->rm_startblock = cpu_to_be32(xfs_refc_block(mp));
-			rrec->rm_blockcount = cpu_to_be32(1);
-			rrec->rm_owner = cpu_to_be64(XFS_RMAP_OWN_REFC);
-			rrec->rm_offset = 0;
-			be16_add_cpu(&block->bb_numrecs, 1);
-		}
+	/* RMAP btree root block */
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+		id->daddr = XFS_AGB_TO_DADDR(mp, id->agno, XFS_RMAP_BLOCK(mp));
+		id->numblks = BTOBB(mp->m_sb.sb_blocksize);
+		id->type = XFS_BTNUM_RMAP;
+		id->numrecs = 0;
+		error = xfs_growfs_init_aghdr(mp, id, xfs_rmaproot_init,
+					   &xfs_rmapbt_buf_ops);
+		if (error)
+			goto out_error;
 
-		xfs_buf_delwri_queue(bp, buffer_list);
-		xfs_buf_relse(bp);
 	}
 
-	/*
-	 * INO btree root block
-	 */
-	bp = xfs_growfs_get_hdr_buf(mp,
-			XFS_AGB_TO_DADDR(mp, agno, XFS_IBT_BLOCK(mp)),
-			BTOBB(mp->m_sb.sb_blocksize), 0,
-			&xfs_inobt_buf_ops);
-	if (!bp) {
-		error = -ENOMEM;
+	/* INO btree root block */
+	id->daddr = XFS_AGB_TO_DADDR(mp, id->agno, XFS_IBT_BLOCK(mp));
+	id->numblks = BTOBB(mp->m_sb.sb_blocksize);
+	id->type = XFS_BTNUM_INO;
+	id->numrecs = 0;
+	error = xfs_growfs_init_aghdr(mp, id, xfs_btroot_init,
+				   &xfs_inobt_buf_ops);
+	if (error)
 		goto out_error;
-	}
 
-	xfs_btree_init_block(mp, bp, XFS_BTNUM_INO , 0, 0, agno, 0);
-	xfs_buf_delwri_queue(bp, buffer_list);
-	xfs_buf_relse(bp);
 
 	/*
 	 * FINO btree root block
 	 */
 	if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
-		bp = xfs_growfs_get_hdr_buf(mp,
-			XFS_AGB_TO_DADDR(mp, agno, XFS_FIBT_BLOCK(mp)),
-			BTOBB(mp->m_sb.sb_blocksize), 0,
-			&xfs_inobt_buf_ops);
-		if (!bp) {
-			error = -ENOMEM;
+		id->daddr = XFS_AGB_TO_DADDR(mp, id->agno, XFS_FIBT_BLOCK(mp));
+		id->numblks = BTOBB(mp->m_sb.sb_blocksize);
+		id->type = XFS_BTNUM_FINO;
+		id->numrecs = 0;
+		error = xfs_growfs_init_aghdr(mp, id, xfs_btroot_init,
+					   &xfs_inobt_buf_ops);
+		if (error)
 			goto out_error;
-		}
-
-		xfs_btree_init_block(mp, bp, XFS_BTNUM_FINO, 0, 0, agno, 0);
-		xfs_buf_delwri_queue(bp, buffer_list);
-		xfs_buf_relse(bp);
 	}
 
 	/*
 	 * refcount btree root block
 	 */
 	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
-		bp = xfs_growfs_get_hdr_buf(mp,
-			XFS_AGB_TO_DADDR(mp, agno, xfs_refc_block(mp)),
-			BTOBB(mp->m_sb.sb_blocksize), 0,
-			&xfs_refcountbt_buf_ops);
-		if (!bp) {
-			error = -ENOMEM;
+		id->daddr = XFS_AGB_TO_DADDR(mp, id->agno, xfs_refc_block(mp));
+		id->numblks = BTOBB(mp->m_sb.sb_blocksize);
+		id->type = XFS_BTNUM_REFC;
+		id->numrecs = 0;
+		error = xfs_growfs_init_aghdr(mp, id, xfs_btroot_init,
+					   &xfs_refcountbt_buf_ops);
+		if (error)
 			goto out_error;
-		}
-
-		xfs_btree_init_block(mp, bp, XFS_BTNUM_REFC, 0, 0, agno, 0);
-		xfs_buf_delwri_queue(bp, buffer_list);
-		xfs_buf_relse(bp);
 	}
 
 out_error:
@@ -461,7 +510,6 @@ xfs_growfs_data_private(
 	xfs_agf_t		*agf;
 	xfs_agi_t		*agi;
 	xfs_agnumber_t		agno;
-	xfs_extlen_t		agsize;
 	xfs_buf_t		*bp;
 	int			dpct;
 	int			error, saved_error = 0;
@@ -469,11 +517,11 @@ xfs_growfs_data_private(
 	xfs_agnumber_t		nagimax = 0;
 	xfs_rfsblock_t		nb, nb_mod;
 	xfs_rfsblock_t		new;
-	xfs_rfsblock_t		nfree;
 	xfs_agnumber_t		oagcount;
 	int			pct;
 	xfs_trans_t		*tp;
 	LIST_HEAD		(buffer_list);
+	struct aghdr_init_data	id = {};
 
 	nb = in->newblocks;
 	pct = in->imaxpct;
@@ -525,27 +573,28 @@ xfs_growfs_data_private(
 	 * list to write, we can cancel the entire list without having written
 	 * anything.
 	 */
-	nfree = 0;
-	for (agno = nagcount - 1; agno >= oagcount; agno--, new -= agsize) {
-
-		if (agno == nagcount - 1)
-			agsize = nb -
-				(agno * (xfs_rfsblock_t)mp->m_sb.sb_agblocks);
+	INIT_LIST_HEAD(&id.buffer_list);
+	for (id.agno = nagcount - 1;
+	     id.agno >= oagcount;
+	     id.agno--, new -= id.agsize) {
+
+		if (id.agno == nagcount - 1)
+			id.agsize = nb -
+				(id.agno * (xfs_rfsblock_t)mp->m_sb.sb_agblocks);
 		else
-			agsize = mp->m_sb.sb_agblocks;
+			id.agsize = mp->m_sb.sb_agblocks;
 
-		error = xfs_grow_ag_headers(mp, agno, agsize, &nfree,
-					    &buffer_list);
+		error = xfs_grow_ag_headers(mp, &id);
 		if (error) {
-			xfs_buf_delwri_cancel(&buffer_list);
+			xfs_buf_delwri_cancel(&id.buffer_list);
 			goto error0;
 		}
 	}
-	error = xfs_buf_delwri_submit(&buffer_list);
+	error = xfs_buf_delwri_submit(&id.buffer_list);
 	if (error)
 		goto error0;
 
-	xfs_trans_agblocks_delta(tp, nfree);
+	xfs_trans_agblocks_delta(tp, id.nfree);
 
 	/*
 	 * There are new blocks in the old last a.g.
@@ -556,7 +605,7 @@ xfs_growfs_data_private(
 		/*
 		 * Change the agi length.
 		 */
-		error = xfs_ialloc_read_agi(mp, tp, agno, &bp);
+		error = xfs_ialloc_read_agi(mp, tp, id.agno, &bp);
 		if (error) {
 			goto error0;
 		}
@@ -569,7 +618,7 @@ xfs_growfs_data_private(
 		/*
 		 * Change agf length.
 		 */
-		error = xfs_alloc_read_agf(mp, tp, agno, 0, &bp);
+		error = xfs_alloc_read_agf(mp, tp, id.agno, 0, &bp);
 		if (error) {
 			goto error0;
 		}
@@ -589,7 +638,7 @@ xfs_growfs_data_private(
 		 */
 		xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_NULL);
 		error = xfs_free_extent(tp,
-				XFS_AGB_TO_FSB(mp, agno,
+				XFS_AGB_TO_FSB(mp, id.agno,
 					be32_to_cpu(agf->agf_length) - new),
 				new, &oinfo, XFS_AG_RESV_NONE);
 		if (error)
@@ -606,8 +655,8 @@ xfs_growfs_data_private(
 	if (nb > mp->m_sb.sb_dblocks)
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS,
 				 nb - mp->m_sb.sb_dblocks);
-	if (nfree)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, nfree);
+	if (id.nfree)
+		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
 	if (dpct)
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
 	xfs_trans_set_sync(tp);
@@ -634,7 +683,7 @@ xfs_growfs_data_private(
 	if (new) {
 		struct xfs_perag	*pag;
 
-		pag = xfs_perag_get(mp, agno);
+		pag = xfs_perag_get(mp, id.agno);
 		error = xfs_ag_resv_free(pag);
 		xfs_perag_put(pag);
 		if (error)
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 04/14] xfs: turn ag header initialisation into a table driven operation
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (2 preceding siblings ...)
  2017-10-26  8:33 ` [PATCH 03/14] xfs: factor ag btree reoot block initialisation Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26  8:33 ` [PATCH 05/14] xfs: make imaxpct changes in growfs separate Dave Chinner
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

There's still more cookie cutter code in setting up each AG header.
Separate all the variables into a simple structure and iterate a table
of header definitions to initialise everything.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_fsops.c | 163 ++++++++++++++++++++++-------------------------------
 1 file changed, 66 insertions(+), 97 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 6744865da887..182f9eafab52 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -365,12 +365,13 @@ xfs_agiblock_init(
 		agi->agi_unlinked[bucket] = cpu_to_be32(NULLAGINO);
 }
 
+typedef void (*aghdr_init_work_f)(struct xfs_mount *mp, struct xfs_buf *bp,
+				  struct aghdr_init_data *id);
 static int
 xfs_growfs_init_aghdr(
 	struct xfs_mount	*mp,
 	struct aghdr_init_data	*id,
-	void			(*work)(struct xfs_mount *, struct xfs_buf *,
-					struct aghdr_init_data *),
+	aghdr_init_work_f	work,
 	const struct xfs_buf_ops *ops)
 
 {
@@ -387,6 +388,16 @@ xfs_growfs_init_aghdr(
 	return 0;
 }
 
+struct xfs_aghdr_grow_data {
+	xfs_daddr_t		daddr;
+	size_t			numblks;
+	const struct xfs_buf_ops *ops;
+	aghdr_init_work_f	work;
+	xfs_btnum_t		type;
+	int			numrecs;
+	bool			need_init;
+};
+
 /*
  * Write new AG headers to disk. Non-transactional, but written
  * synchronously so they are completed prior to the growfs transaction
@@ -398,107 +409,65 @@ xfs_grow_ag_headers(
 	struct aghdr_init_data	*id)
 
 {
+	struct xfs_aghdr_grow_data aghdr_data[] = {
+		/* AGF */
+		{ XFS_AG_DADDR(mp, id->agno, XFS_AGF_DADDR(mp)),
+		  XFS_FSS_TO_BB(mp, 1), &xfs_agf_buf_ops,
+		  &xfs_agfblock_init, 0, 0, true },
+		/* AGFL */
+		{ XFS_AG_DADDR(mp, id->agno, XFS_AGFL_DADDR(mp)),
+		  XFS_FSS_TO_BB(mp, 1), &xfs_agfl_buf_ops,
+		  &xfs_agflblock_init, 0, 0, true },
+		/* AGI */
+		{ XFS_AG_DADDR(mp, id->agno, XFS_AGI_DADDR(mp)),
+		  XFS_FSS_TO_BB(mp, 1), &xfs_agi_buf_ops,
+		  &xfs_agiblock_init, 0, 0, true },
+		/* BNO root block */
+		{ XFS_AGB_TO_DADDR(mp, id->agno, XFS_BNO_BLOCK(mp)),
+		  BTOBB(mp->m_sb.sb_blocksize), &xfs_allocbt_buf_ops,
+		  &xfs_bnoroot_init, XFS_BTNUM_BNO, 1, true },
+		/* CNT root block */
+		{ XFS_AGB_TO_DADDR(mp, id->agno, XFS_CNT_BLOCK(mp)),
+		  BTOBB(mp->m_sb.sb_blocksize), &xfs_allocbt_buf_ops,
+		  &xfs_cntroot_init, XFS_BTNUM_CNT, 1, true },
+		/* INO root block */
+		{ XFS_AGB_TO_DADDR(mp, id->agno, XFS_IBT_BLOCK(mp)),
+		  BTOBB(mp->m_sb.sb_blocksize), &xfs_inobt_buf_ops,
+		  &xfs_btroot_init, XFS_BTNUM_INO, 0, true },
+		/* FINO root block */
+		{ XFS_AGB_TO_DADDR(mp, id->agno, XFS_FIBT_BLOCK(mp)),
+		  BTOBB(mp->m_sb.sb_blocksize), &xfs_inobt_buf_ops,
+		  &xfs_btroot_init, XFS_BTNUM_FINO, 0,
+		  xfs_sb_version_hasfinobt(&mp->m_sb) },
+		/* RMAP root block */
+		{ XFS_AGB_TO_DADDR(mp, id->agno, XFS_RMAP_BLOCK(mp)),
+		  BTOBB(mp->m_sb.sb_blocksize), &xfs_rmapbt_buf_ops,
+		  &xfs_rmaproot_init, XFS_BTNUM_RMAP, 0,
+		  xfs_sb_version_hasrmapbt(&mp->m_sb) },
+		/* REFC root block */
+		{ XFS_AGB_TO_DADDR(mp, id->agno, xfs_refc_block(mp)),
+		  BTOBB(mp->m_sb.sb_blocksize), &xfs_refcountbt_buf_ops,
+		  &xfs_btroot_init, XFS_BTNUM_REFC, 0,
+		  xfs_sb_version_hasreflink(&mp->m_sb) },
+		/* NULL terminating block */
+		{ XFS_BUF_DADDR_NULL, 0, NULL, NULL, 0, 0, false },
+	};
+	struct  xfs_aghdr_grow_data *dp;
 	int			error = 0;
 
-	/* AG freespace header block */
-	id->daddr = XFS_AG_DADDR(mp, id->agno, XFS_AGF_DADDR(mp));
-	id->numblks = XFS_FSS_TO_BB(mp, 1);
-	error = xfs_growfs_init_aghdr(mp, id, xfs_agfblock_init,
-					&xfs_agf_buf_ops);
-	if (error)
-		goto out_error;
-
-	/* AG freelist header block */
-	id->daddr = XFS_AG_DADDR(mp, id->agno, XFS_AGFL_DADDR(mp));
-	id->numblks = XFS_FSS_TO_BB(mp, 1);
-	error = xfs_growfs_init_aghdr(mp, id, xfs_agflblock_init,
-					&xfs_agfl_buf_ops);
-	if (error)
-		goto out_error;
-
-	/* AG inode header block */
-	id->daddr = XFS_AG_DADDR(mp, id->agno, XFS_AGI_DADDR(mp));
-	id->numblks = XFS_FSS_TO_BB(mp, 1);
-	error = xfs_growfs_init_aghdr(mp, id, xfs_agiblock_init,
-					&xfs_agi_buf_ops);
-	if (error)
-		goto out_error;
-
-
-	/* BNO btree root block */
-	id->daddr = XFS_AGB_TO_DADDR(mp, id->agno, XFS_BNO_BLOCK(mp));
-	id->numblks = BTOBB(mp->m_sb.sb_blocksize);
-	id->type = XFS_BTNUM_BNO;
-	id->numrecs = 1;
-	error = xfs_growfs_init_aghdr(mp, id, xfs_bnoroot_init,
-				   &xfs_allocbt_buf_ops);
-	if (error)
-		goto out_error;
-
-
-	/* CNT btree root block */
-	id->daddr = XFS_AGB_TO_DADDR(mp, id->agno, XFS_CNT_BLOCK(mp));
-	id->numblks = BTOBB(mp->m_sb.sb_blocksize);
-	id->type = XFS_BTNUM_CNT;
-	id->numrecs = 1;
-	error = xfs_growfs_init_aghdr(mp, id, xfs_cntroot_init,
-				   &xfs_allocbt_buf_ops);
-	if (error)
-		goto out_error;
-
-	/* RMAP btree root block */
-	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
-		id->daddr = XFS_AGB_TO_DADDR(mp, id->agno, XFS_RMAP_BLOCK(mp));
-		id->numblks = BTOBB(mp->m_sb.sb_blocksize);
-		id->type = XFS_BTNUM_RMAP;
-		id->numrecs = 0;
-		error = xfs_growfs_init_aghdr(mp, id, xfs_rmaproot_init,
-					   &xfs_rmapbt_buf_ops);
-		if (error)
-			goto out_error;
-
-	}
-
-	/* INO btree root block */
-	id->daddr = XFS_AGB_TO_DADDR(mp, id->agno, XFS_IBT_BLOCK(mp));
-	id->numblks = BTOBB(mp->m_sb.sb_blocksize);
-	id->type = XFS_BTNUM_INO;
-	id->numrecs = 0;
-	error = xfs_growfs_init_aghdr(mp, id, xfs_btroot_init,
-				   &xfs_inobt_buf_ops);
-	if (error)
-		goto out_error;
-
-
-	/*
-	 * FINO btree root block
-	 */
-	if (xfs_sb_version_hasfinobt(&mp->m_sb)) {
-		id->daddr = XFS_AGB_TO_DADDR(mp, id->agno, XFS_FIBT_BLOCK(mp));
-		id->numblks = BTOBB(mp->m_sb.sb_blocksize);
-		id->type = XFS_BTNUM_FINO;
-		id->numrecs = 0;
-		error = xfs_growfs_init_aghdr(mp, id, xfs_btroot_init,
-					   &xfs_inobt_buf_ops);
-		if (error)
-			goto out_error;
-	}
+	for (dp = &aghdr_data[0]; dp->daddr != XFS_BUF_DADDR_NULL; dp++) {
+		if (!dp->need_init)
+			continue;
 
-	/*
-	 * refcount btree root block
-	 */
-	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
-		id->daddr = XFS_AGB_TO_DADDR(mp, id->agno, xfs_refc_block(mp));
-		id->numblks = BTOBB(mp->m_sb.sb_blocksize);
-		id->type = XFS_BTNUM_REFC;
-		id->numrecs = 0;
-		error = xfs_growfs_init_aghdr(mp, id, xfs_btroot_init,
-					   &xfs_refcountbt_buf_ops);
+		id->daddr = dp->daddr;
+		id->numblks = dp->numblks;
+		id->numrecs = dp->numrecs;
+		id->type = dp->type;
+		error = xfs_growfs_init_aghdr(mp, id, dp->work, dp->ops);
 		if (error)
-			goto out_error;
+			break;
 	}
 
-out_error:
 	return error;
 }
 
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 05/14] xfs: make imaxpct changes in growfs separate
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (3 preceding siblings ...)
  2017-10-26  8:33 ` [PATCH 04/14] xfs: turn ag header initialisation into a table driven operation Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26  8:33 ` [PATCH 06/14] xfs: separate secondary sb update in growfs Dave Chinner
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When growfs changes the imaxpct value of the filesystem, it runs
through all the "change size" growfs code, whether it needs to or
not. Separate out changing imaxpct into it's own function and
transaction to simplify the rest of the growfs code.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_fsops.c | 67 +++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 182f9eafab52..9f8877388645 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -480,25 +480,21 @@ xfs_growfs_data_private(
 	xfs_agi_t		*agi;
 	xfs_agnumber_t		agno;
 	xfs_buf_t		*bp;
-	int			dpct;
 	int			error, saved_error = 0;
 	xfs_agnumber_t		nagcount;
 	xfs_agnumber_t		nagimax = 0;
 	xfs_rfsblock_t		nb, nb_mod;
 	xfs_rfsblock_t		new;
 	xfs_agnumber_t		oagcount;
-	int			pct;
 	xfs_trans_t		*tp;
 	LIST_HEAD		(buffer_list);
 	struct aghdr_init_data	id = {};
 
 	nb = in->newblocks;
-	pct = in->imaxpct;
-	if (nb < mp->m_sb.sb_dblocks || pct < 0 || pct > 100)
+	if (nb < mp->m_sb.sb_dblocks)
 		return -EINVAL;
 	if ((error = xfs_sb_validate_fsb_count(&mp->m_sb, nb)))
 		return error;
-	dpct = pct - mp->m_sb.sb_imax_pct;
 	error = xfs_buf_read_uncached(mp->m_ddev_targp,
 				XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1),
 				XFS_FSS_TO_BB(mp, 1), 0, &bp, NULL);
@@ -626,8 +622,6 @@ xfs_growfs_data_private(
 				 nb - mp->m_sb.sb_dblocks);
 	if (id.nfree)
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
-	if (dpct)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
 	xfs_trans_set_sync(tp);
 	error = xfs_trans_commit(tp);
 	if (error)
@@ -636,12 +630,6 @@ xfs_growfs_data_private(
 	/* New allocation groups fully initialized, so update mount struct */
 	if (nagimax)
 		mp->m_maxagi = nagimax;
-	if (mp->m_sb.sb_imax_pct) {
-		uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;
-		do_div(icount, 100);
-		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
-	} else
-		mp->m_maxicount = 0;
 	xfs_set_low_space_thresholds(mp);
 	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
 
@@ -745,25 +733,68 @@ xfs_growfs_log_private(
 	return -ENOSYS;
 }
 
+static int
+xfs_growfs_imaxpct(
+	struct xfs_mount	*mp,
+	__u32			imaxpct)
+{
+	struct xfs_trans	*tp;
+	int64_t			dpct;
+	int			error;
+
+	if (imaxpct > 100)
+		return -EINVAL;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata,
+			XFS_GROWFS_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+
+	dpct = (int64_t)imaxpct - mp->m_sb.sb_imax_pct;
+	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
+	xfs_trans_set_sync(tp);
+	return xfs_trans_commit(tp);
+}
+
 /*
  * protected versions of growfs function acquire and release locks on the mount
  * point - exported through ioctls: XFS_IOC_FSGROWFSDATA, XFS_IOC_FSGROWFSLOG,
  * XFS_IOC_FSGROWFSRT
  */
-
-
 int
 xfs_growfs_data(
-	xfs_mount_t		*mp,
-	xfs_growfs_data_t	*in)
+	struct xfs_mount	*mp,
+	struct xfs_growfs_data	*in)
 {
-	int error;
+	int			error = 0;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 	if (!mutex_trylock(&mp->m_growlock))
 		return -EWOULDBLOCK;
+
+	/* update imaxpct seperately to the physical grow of the filesystem */
+	if (in->imaxpct != mp->m_sb.sb_imax_pct) {
+		error = xfs_growfs_imaxpct(mp, in->imaxpct);
+		if (error)
+			goto out_error;
+	}
+
 	error = xfs_growfs_data_private(mp, in);
+	if (error)
+		goto out_error;
+
+	/*
+	 * Post growfs calculations needed to reflect new state in operations
+	 */
+	if (mp->m_sb.sb_imax_pct) {
+		uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;
+		do_div(icount, 100);
+		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
+	} else
+		mp->m_maxicount = 0;
+
+out_error:
 	/*
 	 * Increment the generation unconditionally, the error could be from
 	 * updating the secondary superblocks, in which case the new size
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 06/14] xfs: separate secondary sb update in growfs
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (4 preceding siblings ...)
  2017-10-26  8:33 ` [PATCH 05/14] xfs: make imaxpct changes in growfs separate Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26  8:33 ` [PATCH 07/14] xfs: rework secondary superblock updates " Dave Chinner
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

This happens after all the transactions to update the superblock
occur, and errors need to be handled slightly differently. Seperate
out the code into it's own function, and clean up the error goto
stack in the core growfs code as it is now much simpler.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_fsops.c | 152 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 86 insertions(+), 66 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 9f8877388645..a4da710521f5 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -478,9 +478,8 @@ xfs_growfs_data_private(
 {
 	xfs_agf_t		*agf;
 	xfs_agi_t		*agi;
-	xfs_agnumber_t		agno;
 	xfs_buf_t		*bp;
-	int			error, saved_error = 0;
+	int			error;
 	xfs_agnumber_t		nagcount;
 	xfs_agnumber_t		nagimax = 0;
 	xfs_rfsblock_t		nb, nb_mod;
@@ -552,12 +551,12 @@ xfs_growfs_data_private(
 		error = xfs_grow_ag_headers(mp, &id);
 		if (error) {
 			xfs_buf_delwri_cancel(&id.buffer_list);
-			goto error0;
+			goto out_trans_cancel;
 		}
 	}
 	error = xfs_buf_delwri_submit(&id.buffer_list);
 	if (error)
-		goto error0;
+		goto out_trans_cancel;
 
 	xfs_trans_agblocks_delta(tp, id.nfree);
 
@@ -571,22 +570,23 @@ xfs_growfs_data_private(
 		 * Change the agi length.
 		 */
 		error = xfs_ialloc_read_agi(mp, tp, id.agno, &bp);
-		if (error) {
-			goto error0;
-		}
+		if (error)
+			goto out_trans_cancel;
+
 		ASSERT(bp);
 		agi = XFS_BUF_TO_AGI(bp);
 		be32_add_cpu(&agi->agi_length, new);
 		ASSERT(nagcount == oagcount ||
 		       be32_to_cpu(agi->agi_length) == mp->m_sb.sb_agblocks);
 		xfs_ialloc_log_agi(tp, bp, XFS_AGI_LENGTH);
+
 		/*
 		 * Change agf length.
 		 */
 		error = xfs_alloc_read_agf(mp, tp, id.agno, 0, &bp);
-		if (error) {
-			goto error0;
-		}
+		if (error)
+			goto out_trans_cancel;
+
 		ASSERT(bp);
 		agf = XFS_BUF_TO_AGF(bp);
 		be32_add_cpu(&agf->agf_length, new);
@@ -607,7 +607,7 @@ xfs_growfs_data_private(
 					be32_to_cpu(agf->agf_length) - new),
 				new, &oinfo, XFS_AG_RESV_NONE);
 		if (error)
-			goto error0;
+			goto out_trans_cancel;
 	}
 
 	/*
@@ -644,16 +644,79 @@ xfs_growfs_data_private(
 		error = xfs_ag_resv_free(pag);
 		xfs_perag_put(pag);
 		if (error)
-			goto out;
+			return error;
 	}
 
 	/* Reserve AG metadata blocks. */
-	error = xfs_fs_reserve_ag_blocks(mp);
-	if (error && error != -ENOSPC)
-		goto out;
+	return xfs_fs_reserve_ag_blocks(mp);
+
+out_trans_cancel:
+	xfs_trans_cancel(tp);
+	return error;
+}
+
+static int
+xfs_growfs_log_private(
+	xfs_mount_t		*mp,	/* mount point for filesystem */
+	xfs_growfs_log_t	*in)	/* growfs log input struct */
+{
+	xfs_extlen_t		nb;
+
+	nb = in->newblocks;
+	if (nb < XFS_MIN_LOG_BLOCKS || nb < XFS_B_TO_FSB(mp, XFS_MIN_LOG_BYTES))
+		return -EINVAL;
+	if (nb == mp->m_sb.sb_logblocks &&
+	    in->isint == (mp->m_sb.sb_logstart != 0))
+		return -EINVAL;
+	/*
+	 * Moving the log is hard, need new interfaces to sync
+	 * the log first, hold off all activity while moving it.
+	 * Can have shorter or longer log in the same space,
+	 * or transform internal to external log or vice versa.
+	 */
+	return -ENOSYS;
+}
+
+static int
+xfs_growfs_imaxpct(
+	struct xfs_mount	*mp,
+	__u32			imaxpct)
+{
+	struct xfs_trans	*tp;
+	int			dpct;
+	int			error;
+
+	if (imaxpct > 100)
+		return -EINVAL;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata,
+			XFS_GROWFS_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+
+	dpct = imaxpct - mp->m_sb.sb_imax_pct;
+	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
+	xfs_trans_set_sync(tp);
+	return xfs_trans_commit(tp);
+}
+
+/*
+ * After a grow operation, we need to update all the secondary superblocks
+ * to match the new state of the primary. Read/init the superblocks and update
+ * them appropriately.
+ */
+static int
+xfs_growfs_update_superblocks(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		oagcount)
+{
+	struct xfs_buf		*bp;
+	xfs_agnumber_t		agno;
+	int			saved_error = 0;
+	int			error = 0;
 
 	/* update secondary superblocks. */
-	for (agno = 1; agno < nagcount; agno++) {
+	for (agno = 1; agno < mp->m_sb.sb_agcount; agno++) {
 		error = 0;
 		/*
 		 * new secondary superblocks need to be zeroed, not read from
@@ -703,57 +766,7 @@ xfs_growfs_data_private(
 		}
 	}
 
- out:
 	return saved_error ? saved_error : error;
-
- error0:
-	xfs_trans_cancel(tp);
-	return error;
-}
-
-static int
-xfs_growfs_log_private(
-	xfs_mount_t		*mp,	/* mount point for filesystem */
-	xfs_growfs_log_t	*in)	/* growfs log input struct */
-{
-	xfs_extlen_t		nb;
-
-	nb = in->newblocks;
-	if (nb < XFS_MIN_LOG_BLOCKS || nb < XFS_B_TO_FSB(mp, XFS_MIN_LOG_BYTES))
-		return -EINVAL;
-	if (nb == mp->m_sb.sb_logblocks &&
-	    in->isint == (mp->m_sb.sb_logstart != 0))
-		return -EINVAL;
-	/*
-	 * Moving the log is hard, need new interfaces to sync
-	 * the log first, hold off all activity while moving it.
-	 * Can have shorter or longer log in the same space,
-	 * or transform internal to external log or vice versa.
-	 */
-	return -ENOSYS;
-}
-
-static int
-xfs_growfs_imaxpct(
-	struct xfs_mount	*mp,
-	__u32			imaxpct)
-{
-	struct xfs_trans	*tp;
-	int64_t			dpct;
-	int			error;
-
-	if (imaxpct > 100)
-		return -EINVAL;
-
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata,
-			XFS_GROWFS_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp);
-	if (error)
-		return error;
-
-	dpct = (int64_t)imaxpct - mp->m_sb.sb_imax_pct;
-	xfs_trans_mod_sb(tp, XFS_TRANS_SB_IMAXPCT, dpct);
-	xfs_trans_set_sync(tp);
-	return xfs_trans_commit(tp);
 }
 
 /*
@@ -766,6 +779,7 @@ xfs_growfs_data(
 	struct xfs_mount	*mp,
 	struct xfs_growfs_data	*in)
 {
+	xfs_agnumber_t		oagcount;
 	int			error = 0;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -780,6 +794,7 @@ xfs_growfs_data(
 			goto out_error;
 	}
 
+	oagcount = mp->m_sb.sb_agcount;
 	error = xfs_growfs_data_private(mp, in);
 	if (error)
 		goto out_error;
@@ -794,6 +809,11 @@ xfs_growfs_data(
 	} else
 		mp->m_maxicount = 0;
 
+	/*
+	 * Update secondary superblocks now the physical grow has completed
+	 */
+	error = xfs_growfs_update_superblocks(mp, oagcount);
+
 out_error:
 	/*
 	 * Increment the generation unconditionally, the error could be from
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 07/14] xfs: rework secondary superblock updates in growfs
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (5 preceding siblings ...)
  2017-10-26  8:33 ` [PATCH 06/14] xfs: separate secondary sb update in growfs Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26  8:33 ` [PATCH 08/14] xfs: move various type verifiers to common file Dave Chinner
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Right now we wait until we've committed changes to the primary
superblock before we initialise any of the new secondary
superblocks. This means that if we have any write errors for new
secondary superblocks we end up with garbage in place rather than
zeros or even an "in progress" superblock to indicate a grow
operation is being done.

To ensure we can write the secondary superblocks, initialise them
earlier in the same loop that initialises the AG headers. We stamp
the new secondary superblocks here with the old geometry, but set
the "sb_inprogress" field to indicate that updates are being done to
the superblock so they cannot be used.  This will result in the
secondary superblock fields being updated or triggering errors that
will abort the grow before we commit any permanent changes.

This also means we can change the update mechanism of the secondary
superblocks.  We know that we are going to wholly overwrite the
information in the struct xfs_sb in teh buffer, so there's no point
reading it from disk. Just allocate an uncached buffer, zero it in
memory, stamp the new superblock structure in it and write it out.
If we fail to write it out, then we'll leave the existing sb (old or
new w/ inprogress) on disk for reapir to deal with later.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_fsops.c | 92 ++++++++++++++++++++++++++++++++----------------------
 1 file changed, 55 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index a4da710521f5..34c9fc257c2f 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -274,6 +274,25 @@ xfs_rmaproot_init(
 	}
 }
 
+/*
+ * Initialise new secondary superblocks with the pre-grow geometry, but mark
+ * them as "in progress" so we know they haven't yet been activated. This will
+ * get cleared when the update with the new geometry information is done after
+ * changes to the primary are committed. This isn't strictly necessary, but we
+ * get it for free with the delayed buffer write lists and it means we can tell
+ * if a grow operation didn't complete properly after the fact.
+ */
+static void
+xfs_sbblock_init(
+	struct xfs_mount	*mp,
+	struct xfs_buf		*bp,
+	struct aghdr_init_data	*id)
+{
+	struct xfs_dsb		*dsb = XFS_BUF_TO_SBP(bp);
+
+	xfs_sb_to_disk(dsb, &mp->m_sb);
+	dsb->sb_inprogress = 1;
+}
 
 static void
 xfs_agfblock_init(
@@ -410,6 +429,10 @@ xfs_grow_ag_headers(
 
 {
 	struct xfs_aghdr_grow_data aghdr_data[] = {
+		/* SB */
+		{ XFS_AG_DADDR(mp, id->agno, XFS_SB_DADDR),
+		  XFS_FSS_TO_BB(mp, 1), &xfs_sb_buf_ops,
+		  &xfs_sbblock_init, 0, 0, true },
 		/* AGF */
 		{ XFS_AG_DADDR(mp, id->agno, XFS_AGF_DADDR(mp)),
 		  XFS_FSS_TO_BB(mp, 1), &xfs_agf_buf_ops,
@@ -702,43 +725,27 @@ xfs_growfs_imaxpct(
 
 /*
  * After a grow operation, we need to update all the secondary superblocks
- * to match the new state of the primary. Read/init the superblocks and update
- * them appropriately.
+ * to match the new state of the primary. Because we are completely overwriting
+ * all the existing fields in the secondary superblock buffers, there is no need
+ * to read them in from disk. Just get a new uncached buffer, stamp it and
+ * write it.
  */
 static int
 xfs_growfs_update_superblocks(
-	struct xfs_mount	*mp,
-	xfs_agnumber_t		oagcount)
+	struct xfs_mount	*mp)
 {
-	struct xfs_buf		*bp;
 	xfs_agnumber_t		agno;
 	int			saved_error = 0;
 	int			error = 0;
+	LIST_HEAD		(buffer_list);
 
 	/* update secondary superblocks. */
 	for (agno = 1; agno < mp->m_sb.sb_agcount; agno++) {
-		error = 0;
-		/*
-		 * new secondary superblocks need to be zeroed, not read from
-		 * disk as the contents of the new area we are growing into is
-		 * completely unknown.
-		 */
-		if (agno < oagcount) {
-			error = xfs_trans_read_buf(mp, NULL, mp->m_ddev_targp,
-				  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
-				  XFS_FSS_TO_BB(mp, 1), 0, &bp,
-				  &xfs_sb_buf_ops);
-		} else {
-			bp = xfs_trans_get_buf(NULL, mp->m_ddev_targp,
-				  XFS_AGB_TO_DADDR(mp, agno, XFS_SB_BLOCK(mp)),
-				  XFS_FSS_TO_BB(mp, 1), 0);
-			if (bp) {
-				bp->b_ops = &xfs_sb_buf_ops;
-				xfs_buf_zero(bp, 0, BBTOB(bp->b_length));
-			} else
-				error = -ENOMEM;
-		}
+		struct xfs_buf		*bp;
 
+		bp = xfs_growfs_get_hdr_buf(mp,
+				XFS_AG_DADDR(mp, agno, XFS_SB_DADDR),
+				XFS_FSS_TO_BB(mp, 1), 0, &xfs_sb_buf_ops);
 		/*
 		 * If we get an error reading or writing alternate superblocks,
 		 * continue.  xfs_repair chooses the "best" superblock based
@@ -746,25 +753,38 @@ xfs_growfs_update_superblocks(
 		 * superblocks un-updated than updated, and xfs_repair may
 		 * pick them over the properly-updated primary.
 		 */
-		if (error) {
+		if (!bp) {
 			xfs_warn(mp,
-		"error %d reading secondary superblock for ag %d",
-				error, agno);
-			saved_error = error;
+		"error allocating secondary superblock for ag %d",
+				agno);
+			if (!saved_error)
+				saved_error = -ENOMEM;
 			continue;
 		}
 		xfs_sb_to_disk(XFS_BUF_TO_SBP(bp), &mp->m_sb);
-
-		error = xfs_bwrite(bp);
+		xfs_buf_delwri_queue(bp, &buffer_list);
 		xfs_buf_relse(bp);
+
+		/* don't hold too many buffers at once */
+		if (agno % 16)
+			continue;
+
+		error = xfs_buf_delwri_submit(&buffer_list);
 		if (error) {
 			xfs_warn(mp,
-		"write error %d updating secondary superblock for ag %d",
+		"write error %d updating a secondary superblock near ag %d",
 				error, agno);
-			saved_error = error;
+			if (!saved_error)
+				saved_error = error;
 			continue;
 		}
 	}
+	error = xfs_buf_delwri_submit(&buffer_list);
+	if (error) {
+		xfs_warn(mp,
+		"write error %d updating a secondary superblock near ag %d",
+			error, agno);
+	}
 
 	return saved_error ? saved_error : error;
 }
@@ -779,7 +799,6 @@ xfs_growfs_data(
 	struct xfs_mount	*mp,
 	struct xfs_growfs_data	*in)
 {
-	xfs_agnumber_t		oagcount;
 	int			error = 0;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -794,7 +813,6 @@ xfs_growfs_data(
 			goto out_error;
 	}
 
-	oagcount = mp->m_sb.sb_agcount;
 	error = xfs_growfs_data_private(mp, in);
 	if (error)
 		goto out_error;
@@ -812,7 +830,7 @@ xfs_growfs_data(
 	/*
 	 * Update secondary superblocks now the physical grow has completed
 	 */
-	error = xfs_growfs_update_superblocks(mp, oagcount);
+	error = xfs_growfs_update_superblocks(mp);
 
 out_error:
 	/*
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 08/14] xfs: move various type verifiers to common file
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (6 preceding siblings ...)
  2017-10-26  8:33 ` [PATCH 07/14] xfs: rework secondary superblock updates " Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26  8:33 ` [PATCH 09/14] xfs: split usable space from block device size Dave Chinner
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

New verification functions like xfs_verify_fsbno() and
xfs_verify_agino() are spread across multiple files and different
header files. They really don't fit cleanly into the places they've
been put, and have wider scope than the current header includes.

Move the type verifiers to a new file in libxfs (xfs-types.c) and
the prototypes to xfs_types.h where they will be visible to all the
code that uses the types.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/Makefile            |   1 +
 fs/xfs/libxfs/xfs_alloc.c  |  49 -------------
 fs/xfs/libxfs/xfs_alloc.h  |   4 --
 fs/xfs/libxfs/xfs_ialloc.c |  89 ------------------------
 fs/xfs/libxfs/xfs_ialloc.h |   7 --
 fs/xfs/libxfs/xfs_types.c  | 169 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_types.h  |  23 ++++++
 fs/xfs/scrub/agheader.c    |   2 +-
 fs/xfs/xfs_rtalloc.h       |   2 -
 9 files changed, 194 insertions(+), 152 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_types.c

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 0182fb816f6e..33d404d4a013 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -60,6 +60,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_sb.o \
 				   xfs_symlink_remote.o \
 				   xfs_trans_resv.o \
+				   xfs_types.o \
 				   )
 # xfs_rtbitmap is shared with libxfs
 xfs-$(CONFIG_XFS_RT)		+= $(addprefix libxfs/, \
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index b13b0db21fa1..8eb5dd1d7588 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2950,55 +2950,6 @@ xfs_alloc_query_all(
 	return xfs_btree_query_all(cur, xfs_alloc_query_range_helper, &query);
 }
 
-/* Find the size of the AG, in blocks. */
-xfs_agblock_t
-xfs_ag_block_count(
-	struct xfs_mount	*mp,
-	xfs_agnumber_t		agno)
-{
-	ASSERT(agno < mp->m_sb.sb_agcount);
-
-	if (agno < mp->m_sb.sb_agcount - 1)
-		return mp->m_sb.sb_agblocks;
-	return mp->m_sb.sb_dblocks - (agno * mp->m_sb.sb_agblocks);
-}
-
-/*
- * Verify that an AG block number pointer neither points outside the AG
- * nor points at static metadata.
- */
-bool
-xfs_verify_agbno(
-	struct xfs_mount	*mp,
-	xfs_agnumber_t		agno,
-	xfs_agblock_t		agbno)
-{
-	xfs_agblock_t		eoag;
-
-	eoag = xfs_ag_block_count(mp, agno);
-	if (agbno >= eoag)
-		return false;
-	if (agbno <= XFS_AGFL_BLOCK(mp))
-		return false;
-	return true;
-}
-
-/*
- * Verify that an FS block number pointer neither points outside the
- * filesystem nor points at static AG metadata.
- */
-bool
-xfs_verify_fsbno(
-	struct xfs_mount	*mp,
-	xfs_fsblock_t		fsbno)
-{
-	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
-
-	if (agno >= mp->m_sb.sb_agcount)
-		return false;
-	return xfs_verify_agbno(mp, agno, XFS_FSB_TO_AGBNO(mp, fsbno));
-}
-
 /* Is there a record covering a given extent? */
 int
 xfs_alloc_has_record(
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 65a0cafe06e4..a31424d4a5eb 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -239,10 +239,6 @@ int xfs_alloc_query_range(struct xfs_btree_cur *cur,
 		xfs_alloc_query_range_fn fn, void *priv);
 int xfs_alloc_query_all(struct xfs_btree_cur *cur, xfs_alloc_query_range_fn fn,
 		void *priv);
-xfs_agblock_t xfs_ag_block_count(struct xfs_mount *mp, xfs_agnumber_t agno);
-bool xfs_verify_agbno(struct xfs_mount *mp, xfs_agnumber_t agno,
-		xfs_agblock_t agbno);
-bool xfs_verify_fsbno(struct xfs_mount *mp, xfs_fsblock_t fsbno);
 
 int xfs_alloc_has_record(struct xfs_btree_cur *cur, xfs_agblock_t bno,
 		xfs_extlen_t len, bool *exist);
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index e64ae27f28b3..132b8c7af263 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2669,95 +2669,6 @@ xfs_ialloc_pagi_init(
 	return 0;
 }
 
-/* Calculate the first and last possible inode number in an AG. */
-void
-xfs_ialloc_agino_range(
-	struct xfs_mount	*mp,
-	xfs_agnumber_t		agno,
-	xfs_agino_t		*first,
-	xfs_agino_t		*last)
-{
-	xfs_agblock_t		bno;
-	xfs_agblock_t		eoag;
-
-	eoag = xfs_ag_block_count(mp, agno);
-
-	/*
-	 * Calculate the first inode, which will be in the first
-	 * cluster-aligned block after the AGFL.
-	 */
-	bno = round_up(XFS_AGFL_BLOCK(mp) + 1,
-			xfs_ialloc_cluster_alignment(mp));
-	*first = XFS_OFFBNO_TO_AGINO(mp, bno, 0);
-
-	/*
-	 * Calculate the last inode, which will be at the end of the
-	 * last (aligned) cluster that can be allocated in the AG.
-	 */
-	bno = round_down(eoag, xfs_ialloc_cluster_alignment(mp));
-	*last = XFS_OFFBNO_TO_AGINO(mp, bno, 0) - 1;
-}
-
-/*
- * Verify that an AG inode number pointer neither points outside the AG
- * nor points at static metadata.
- */
-bool
-xfs_verify_agino(
-	struct xfs_mount	*mp,
-	xfs_agnumber_t		agno,
-	xfs_agino_t		agino)
-{
-	xfs_agino_t		first;
-	xfs_agino_t		last;
-
-	xfs_ialloc_agino_range(mp, agno, &first, &last);
-	return agino >= first && agino <= last;
-}
-
-/*
- * Verify that an FS inode number pointer neither points outside the
- * filesystem nor points at static AG metadata.
- */
-bool
-xfs_verify_ino(
-	struct xfs_mount	*mp,
-	xfs_ino_t		ino)
-{
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ino);
-	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ino);
-
-	if (agno >= mp->m_sb.sb_agcount)
-		return false;
-	if (XFS_AGINO_TO_INO(mp, agno, agino) != ino)
-		return false;
-	return xfs_verify_agino(mp, agno, agino);
-}
-
-/* Is this an internal inode number? */
-bool
-xfs_internal_inum(
-	struct xfs_mount	*mp,
-	xfs_ino_t		ino)
-{
-	return ino == mp->m_sb.sb_rbmino || ino == mp->m_sb.sb_rsumino ||
-		(xfs_sb_version_hasquota(&mp->m_sb) &&
-		 xfs_is_quota_inode(&mp->m_sb, ino));
-}
-
-/*
- * Verify that a directory entry's inode number doesn't point at an internal
- * inode, empty space, or static AG metadata.
- */
-bool
-xfs_verify_dir_ino(
-	struct xfs_mount	*mp,
-	xfs_ino_t		ino)
-{
-	if (xfs_internal_inum(mp, ino))
-		return false;
-	return xfs_verify_ino(mp, ino);
-}
 
 /* Is there an inode record covering a given range of inode numbers? */
 int
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 0085620591da..42b8c34a3878 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -182,12 +182,5 @@ int xfs_inobt_insert_rec(struct xfs_btree_cur *cur, uint16_t holemask,
 		int *stat);
 
 int xfs_ialloc_cluster_alignment(struct xfs_mount *mp);
-void xfs_ialloc_agino_range(struct xfs_mount *mp, xfs_agnumber_t agno,
-		xfs_agino_t *first, xfs_agino_t *last);
-bool xfs_verify_agino(struct xfs_mount *mp, xfs_agnumber_t agno,
-		xfs_agino_t agino);
-bool xfs_verify_ino(struct xfs_mount *mp, xfs_ino_t ino);
-bool xfs_internal_inum(struct xfs_mount *mp, xfs_ino_t ino);
-bool xfs_verify_dir_ino(struct xfs_mount *mp, xfs_ino_t ino);
 
 #endif	/* __XFS_IALLOC_H__ */
diff --git a/fs/xfs/libxfs/xfs_types.c b/fs/xfs/libxfs/xfs_types.c
new file mode 100644
index 000000000000..16d2488797a1
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_types.c
@@ -0,0 +1,169 @@
+/*
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_shared.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+
+/* Find the size of the AG, in blocks. */
+xfs_agblock_t
+xfs_ag_block_count(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	ASSERT(agno < mp->m_sb.sb_agcount);
+
+	if (agno < mp->m_sb.sb_agcount - 1)
+		return mp->m_sb.sb_agblocks;
+	return mp->m_sb.sb_dblocks - (agno * mp->m_sb.sb_agblocks);
+}
+
+/*
+ * Verify that an AG block number pointer neither points outside the AG
+ * nor points at static metadata.
+ */
+bool
+xfs_verify_agbno(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		agbno)
+{
+	xfs_agblock_t		eoag;
+
+	eoag = xfs_ag_block_count(mp, agno);
+	if (agbno >= eoag)
+		return false;
+	if (agbno <= XFS_AGFL_BLOCK(mp))
+		return false;
+	return true;
+}
+
+/*
+ * Verify that an FS block number pointer neither points outside the
+ * filesystem nor points at static AG metadata.
+ */
+bool
+xfs_verify_fsbno(
+	struct xfs_mount	*mp,
+	xfs_fsblock_t		fsbno)
+{
+	xfs_agnumber_t		agno = XFS_FSB_TO_AGNO(mp, fsbno);
+
+	if (agno >= mp->m_sb.sb_agcount)
+		return false;
+	return xfs_verify_agbno(mp, agno, XFS_FSB_TO_AGBNO(mp, fsbno));
+}
+
+/* Calculate the first and last possible inode number in an AG. */
+void
+xfs_agino_range(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agino_t		*first,
+	xfs_agino_t		*last)
+{
+	xfs_agblock_t		bno;
+	xfs_agblock_t		eoag;
+
+	eoag = xfs_ag_block_count(mp, agno);
+
+	/*
+	 * Calculate the first inode, which will be in the first
+	 * cluster-aligned block after the AGFL.
+	 */
+	bno = round_up(XFS_AGFL_BLOCK(mp) + 1,
+			xfs_ialloc_cluster_alignment(mp));
+	*first = XFS_OFFBNO_TO_AGINO(mp, bno, 0);
+
+	/*
+	 * Calculate the last inode, which will be at the end of the
+	 * last (aligned) cluster that can be allocated in the AG.
+	 */
+	bno = round_down(eoag, xfs_ialloc_cluster_alignment(mp));
+	*last = XFS_OFFBNO_TO_AGINO(mp, bno, 0) - 1;
+}
+
+/*
+ * Verify that an AG inode number pointer neither points outside the AG
+ * nor points at static metadata.
+ */
+bool
+xfs_verify_agino(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agino_t		agino)
+{
+	xfs_agino_t		first;
+	xfs_agino_t		last;
+
+	xfs_agino_range(mp, agno, &first, &last);
+	return agino >= first && agino <= last;
+}
+
+/*
+ * Verify that an FS inode number pointer neither points outside the
+ * filesystem nor points at static AG metadata.
+ */
+bool
+xfs_verify_ino(
+	struct xfs_mount	*mp,
+	xfs_ino_t		ino)
+{
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, ino);
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, ino);
+
+	if (agno >= mp->m_sb.sb_agcount)
+		return false;
+	if (XFS_AGINO_TO_INO(mp, agno, agino) != ino)
+		return false;
+	return xfs_verify_agino(mp, agno, agino);
+}
+
+/* Is this an internal inode number? */
+bool
+xfs_internal_inum(
+	struct xfs_mount	*mp,
+	xfs_ino_t		ino)
+{
+	return ino == mp->m_sb.sb_rbmino || ino == mp->m_sb.sb_rsumino ||
+		(xfs_sb_version_hasquota(&mp->m_sb) &&
+		 xfs_is_quota_inode(&mp->m_sb, ino));
+}
+
+/*
+ * Verify that a directory entry's inode number doesn't point at an internal
+ * inode, empty space, or static AG metadata.
+ */
+bool
+xfs_verify_dir_ino(
+	struct xfs_mount	*mp,
+	xfs_ino_t		ino)
+{
+	if (xfs_internal_inum(mp, ino))
+		return false;
+	return xfs_verify_ino(mp, ino);
+}
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 0220159bd463..a17070ef51b8 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -136,5 +136,28 @@ typedef uint32_t	xfs_dqid_t;
 #define	XFS_NBWORD	(1 << XFS_NBWORDLOG)
 #define	XFS_WORDMASK	((1 << XFS_WORDLOG) - 1)
 
+/*
+ * Type verifier functions
+ */
+struct xfs_mount;
+
+xfs_agblock_t xfs_ag_block_count(struct xfs_mount *mp, xfs_agnumber_t agno);
+bool xfs_verify_agbno(struct xfs_mount *mp, xfs_agnumber_t agno,
+		xfs_agblock_t agbno);
+bool xfs_verify_fsbno(struct xfs_mount *mp, xfs_fsblock_t fsbno);
+
+void xfs_agino_range(struct xfs_mount *mp, xfs_agnumber_t agno,
+		xfs_agino_t *first, xfs_agino_t *last);
+bool xfs_verify_agino(struct xfs_mount *mp, xfs_agnumber_t agno,
+		xfs_agino_t agino);
+bool xfs_verify_ino(struct xfs_mount *mp, xfs_ino_t ino);
+bool xfs_internal_inum(struct xfs_mount *mp, xfs_ino_t ino);
+bool xfs_verify_dir_ino(struct xfs_mount *mp, xfs_ino_t ino);
+
+#ifdef CONFIG_XFS_RT
+bool xfs_verify_rtbno(struct xfs_mount *mp, xfs_rtblock_t rtbno);
+#else
+#define xfs_verify_rtbno(m, r)			(false)
+#endif
 
 #endif	/* __XFS_TYPES_H__ */
diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index 608b44c4ced8..0f638ba311db 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -1349,7 +1349,7 @@ xfs_scrub_agi(
 	}
 
 	/* Check inode counters */
-	xfs_ialloc_agino_range(mp, agno, &first_agino, &last_agino);
+	xfs_agino_range(mp, agno, &first_agino, &last_agino);
 	icount = be32_to_cpu(agi->agi_count);
 	if (icount > last_agino - first_agino + 1 ||
 	    icount < be32_to_cpu(agi->agi_freecount))
diff --git a/fs/xfs/xfs_rtalloc.h b/fs/xfs/xfs_rtalloc.h
index 540d506675cf..4aa70448d904 100644
--- a/fs/xfs/xfs_rtalloc.h
+++ b/fs/xfs/xfs_rtalloc.h
@@ -138,7 +138,6 @@ int xfs_rtalloc_query_range(struct xfs_trans *tp,
 int xfs_rtalloc_query_all(struct xfs_trans *tp,
 			  xfs_rtalloc_query_range_fn fn,
 			  void *priv);
-bool xfs_verify_rtbno(struct xfs_mount *mp, xfs_rtblock_t rtbno);
 int xfs_rtalloc_extent_is_free(struct xfs_mount *mp, struct xfs_trans *tp,
 			       xfs_rtblock_t start, xfs_rtblock_t len,
 			       bool *is_free);
@@ -150,7 +149,6 @@ int xfs_rtalloc_extent_is_free(struct xfs_mount *mp, struct xfs_trans *tp,
 # define xfs_rtalloc_query_range(t,l,h,f,p)             (ENOSYS)
 # define xfs_rtalloc_query_all(t,f,p)                   (ENOSYS)
 # define xfs_rtbuf_get(m,t,b,i,p)                       (ENOSYS)
-# define xfs_verify_rtbno(m, r)			(false)
 # define xfs_rtalloc_extent_is_free(m,t,s,l,i)          (ENOSYS)
 static inline int		/* error */
 xfs_rtmount_init(
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 09/14] xfs: split usable space from block device size
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (7 preceding siblings ...)
  2017-10-26  8:33 ` [PATCH 08/14] xfs: move various type verifiers to common file Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26  8:33 ` [PATCH 10/14] xfs: hide reserved metadata space from users Dave Chinner
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The superblock field sb_dblocks is used for two purposes: to define
the size of the block device we are operating, and to define the
maximum usable space in the filesystem. Whilst this definition might
look like I'm splitting hairs, this separation of "block device size
vs usable space in the block device" was made a long time ago by
thinly provisioned devices.

That is, the size of address space presented to the filesystem does
not define the usuable space in the block device, and one of the big
problems we have with thinly provisioned devices is that we can't
make this distinction at the filesystem level.

The first step to supporting thinly provisioned storage directly in
XFS is to fix this mismatch. To do this, we really need to abstract
the two different use cases away from the superblock configuration.
This patch adds two variables to the struct xfs_mount to do this:

	m_LBA_size
	m_usable_blocks

Both are initialised from sb_dblocks, and the rest of the code is
adjusted to use the approprate variable. Where we ar checking for
valid addresses or need to check against the geometry of teh
filesystem, we use m_LBA_size (e.g. fsbno verification). Where we
are using sb_dblocks as an indication as the maximum number of
allocatable blocks in the filesystem, we use m_usable_blocks (e.g.
calculating low space thresholds).

This separation will now allow us to modify the on-disk format
to adjust the usable space separately to teh size of the block
device the filesystem sits on without impacting any other code
or existing filesystem behaviour.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c |  6 +++---
 fs/xfs/libxfs/xfs_sb.c     |  6 ++++++
 fs/xfs/libxfs/xfs_types.c  |  2 +-
 fs/xfs/xfs_bmap_item.c     |  7 +++----
 fs/xfs/xfs_buf.c           |  2 +-
 fs/xfs/xfs_discard.c       |  6 +++---
 fs/xfs/xfs_extfree_item.c  |  3 +--
 fs/xfs/xfs_fsmap.c         |  2 +-
 fs/xfs/xfs_fsops.c         | 17 ++++++++++-------
 fs/xfs/xfs_mount.c         | 17 +++++++++--------
 fs/xfs/xfs_mount.h         |  2 ++
 fs/xfs/xfs_refcount_item.c |  4 ++--
 fs/xfs/xfs_rmap_item.c     |  4 ++--
 fs/xfs/xfs_super.c         |  4 ++--
 14 files changed, 46 insertions(+), 36 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 132b8c7af263..f168339423b5 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2388,12 +2388,12 @@ xfs_imap(
 	 * driver.
 	 */
 	if ((imap->im_blkno + imap->im_len) >
-	    XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) {
+	    XFS_FSB_TO_BB(mp, mp->m_LBA_size)) {
 		xfs_alert(mp,
-	"%s: (im_blkno (0x%llx) + im_len (0x%llx)) > sb_dblocks (0x%llx)",
+	"%s: (im_blkno (0x%llx) + im_len (0x%llx)) > device size (0x%llx)",
 			__func__, (unsigned long long) imap->im_blkno,
 			(unsigned long long) imap->im_len,
-			XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks));
+			XFS_FSB_TO_BB(mp, mp->m_LBA_size));
 		return -EINVAL;
 	}
 	return 0;
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 9b49640a65d6..87b57abeace2 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -728,6 +728,12 @@ xfs_sb_mount_common(
 	mp->m_blockwsize = sbp->sb_blocksize >> XFS_WORDLOG;
 	mp->m_blockwmask = mp->m_blockwsize - 1;
 
+	/*
+	 * Set up the filesystem size and addressing limits
+	 */
+	mp->m_LBA_size = sbp->sb_dblocks;
+	mp->m_usable_blocks = sbp->sb_dblocks;
+
 	mp->m_alloc_mxr[0] = xfs_allocbt_maxrecs(mp, sbp->sb_blocksize, 1);
 	mp->m_alloc_mxr[1] = xfs_allocbt_maxrecs(mp, sbp->sb_blocksize, 0);
 	mp->m_alloc_mnr[0] = mp->m_alloc_mxr[0] / 2;
diff --git a/fs/xfs/libxfs/xfs_types.c b/fs/xfs/libxfs/xfs_types.c
index 16d2488797a1..092c032fee51 100644
--- a/fs/xfs/libxfs/xfs_types.c
+++ b/fs/xfs/libxfs/xfs_types.c
@@ -39,7 +39,7 @@ xfs_ag_block_count(
 
 	if (agno < mp->m_sb.sb_agcount - 1)
 		return mp->m_sb.sb_agblocks;
-	return mp->m_sb.sb_dblocks - (agno * mp->m_sb.sb_agblocks);
+	return mp->m_LBA_size - (agno * mp->m_sb.sb_agblocks);
 }
 
 /*
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index dd136f7275e4..ade97e8180b3 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -435,12 +435,11 @@ xfs_bui_recover(
 		op_ok = false;
 		break;
 	}
-	if (!op_ok || startblock_fsb == 0 ||
+	if (!op_ok ||
+	    !xfs_verify_fsbno(mp, startblock_fsb) ||
+	    !xfs_verify_fsbno(mp, inode_fsb) ||
 	    bmap->me_len == 0 ||
-	    inode_fsb == 0 ||
-	    startblock_fsb >= mp->m_sb.sb_dblocks ||
 	    bmap->me_len >= mp->m_sb.sb_agblocks ||
-	    inode_fsb >= mp->m_sb.sb_dblocks ||
 	    (bmap->me_flags & ~XFS_BMAP_EXTENT_FLAGS)) {
 		/*
 		 * This will pull the BUI from the AIL and
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 2f97c12ca75e..0b51922aeebc 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -575,7 +575,7 @@ _xfs_buf_find(
 	 * Corrupted block numbers can get through to here, unfortunately, so we
 	 * have to check that the buffer falls within the filesystem bounds.
 	 */
-	eofs = XFS_FSB_TO_BB(btp->bt_mount, btp->bt_mount->m_sb.sb_dblocks);
+	eofs = XFS_FSB_TO_BB(btp->bt_mount, btp->bt_mount->m_LBA_size);
 	if (cmap.bm_bn < 0 || cmap.bm_bn >= eofs) {
 		/*
 		 * XXX (dgc): we should really be returning -EFSCORRUPTED here,
diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c
index b2cde5426182..be247d61961f 100644
--- a/fs/xfs/xfs_discard.c
+++ b/fs/xfs/xfs_discard.c
@@ -183,7 +183,7 @@ xfs_ioc_trim(
 	 * used by the fstrim application.  In the end it really doesn't
 	 * matter as trimming blocks is an advisory interface.
 	 */
-	if (range.start >= XFS_FSB_TO_B(mp, mp->m_sb.sb_dblocks) ||
+	if (range.start >= XFS_FSB_TO_B(mp, mp->m_LBA_size) ||
 	    range.minlen > XFS_FSB_TO_B(mp, mp->m_ag_max_usable) ||
 	    range.len < mp->m_sb.sb_blocksize)
 		return -EINVAL;
@@ -192,8 +192,8 @@ xfs_ioc_trim(
 	end = start + BTOBBT(range.len) - 1;
 	minlen = BTOBB(max_t(u64, granularity, range.minlen));
 
-	if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1)
-		end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)- 1;
+	if (end > XFS_FSB_TO_BB(mp, mp->m_LBA_size) - 1)
+		end = XFS_FSB_TO_BB(mp, mp->m_LBA_size)- 1;
 
 	start_agno = xfs_daddr_to_agno(mp, start);
 	end_agno = xfs_daddr_to_agno(mp, end);
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 44f8c5451210..c6e5ff779199 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -519,9 +519,8 @@ xfs_efi_recover(
 		extp = &efip->efi_format.efi_extents[i];
 		startblock_fsb = XFS_BB_TO_FSB(mp,
 				   XFS_FSB_TO_DADDR(mp, extp->ext_start));
-		if (startblock_fsb == 0 ||
+		if (!xfs_verify_fsbno(mp, startblock_fsb) ||
 		    extp->ext_len == 0 ||
-		    startblock_fsb >= mp->m_sb.sb_dblocks ||
 		    extp->ext_len >= mp->m_sb.sb_agblocks) {
 			/*
 			 * This will pull the EFI from the AIL and
diff --git a/fs/xfs/xfs_fsmap.c b/fs/xfs/xfs_fsmap.c
index 43cfc07996a4..e37c26a5d534 100644
--- a/fs/xfs/xfs_fsmap.c
+++ b/fs/xfs/xfs_fsmap.c
@@ -585,7 +585,7 @@ __xfs_getfsmap_datadev(
 	xfs_daddr_t			eofs;
 	int				error = 0;
 
-	eofs = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
+	eofs = XFS_FSB_TO_BB(mp, mp->m_LBA_size);
 	if (keys[0].fmr_physical >= eofs)
 		return 0;
 	if (keys[1].fmr_physical >= eofs)
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 34c9fc257c2f..f33c74f2e925 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -66,7 +66,7 @@ xfs_fs_geometry(
 	geo->sectsize = mp->m_sb.sb_sectsize;
 	geo->inodesize = mp->m_sb.sb_inodesize;
 	geo->imaxpct = mp->m_sb.sb_imax_pct;
-	geo->datablocks = mp->m_sb.sb_dblocks;
+	geo->datablocks = mp->m_LBA_size;
 	geo->rtblocks = mp->m_sb.sb_rblocks;
 	geo->rtextents = mp->m_sb.sb_rextents;
 	geo->logstart = mp->m_sb.sb_logstart;
@@ -513,7 +513,7 @@ xfs_growfs_data_private(
 	struct aghdr_init_data	id = {};
 
 	nb = in->newblocks;
-	if (nb < mp->m_sb.sb_dblocks)
+	if (nb < mp->m_LBA_size)
 		return -EINVAL;
 	if ((error = xfs_sb_validate_fsb_count(&mp->m_sb, nb)))
 		return error;
@@ -530,10 +530,10 @@ xfs_growfs_data_private(
 	if (nb_mod && nb_mod < XFS_MIN_AG_BLOCKS) {
 		nagcount--;
 		nb = (xfs_rfsblock_t)nagcount * mp->m_sb.sb_agblocks;
-		if (nb < mp->m_sb.sb_dblocks)
+		if (nb < mp->m_LBA_size)
 			return -EINVAL;
 	}
-	new = nb - mp->m_sb.sb_dblocks;
+	new = nb - mp->m_LBA_size;
 	oagcount = mp->m_sb.sb_agcount;
 
 	/* allocate the new per-ag structures */
@@ -640,9 +640,9 @@ xfs_growfs_data_private(
 	 */
 	if (nagcount > oagcount)
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
-	if (nb > mp->m_sb.sb_dblocks)
+	if (nb > mp->m_LBA_size)
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS,
-				 nb - mp->m_sb.sb_dblocks);
+				 nb - mp->m_LBA_size);
 	if (id.nfree)
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
 	xfs_trans_set_sync(tp);
@@ -651,6 +651,9 @@ xfs_growfs_data_private(
 		return error;
 
 	/* New allocation groups fully initialized, so update mount struct */
+	mp->m_usable_blocks = mp->m_sb.sb_dblocks;
+	mp->m_LBA_size = mp->m_sb.sb_dblocks;
+
 	if (nagimax)
 		mp->m_maxagi = nagimax;
 	xfs_set_low_space_thresholds(mp);
@@ -821,7 +824,7 @@ xfs_growfs_data(
 	 * Post growfs calculations needed to reflect new state in operations
 	 */
 	if (mp->m_sb.sb_imax_pct) {
-		uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;
+		uint64_t icount = mp->m_usable_blocks * mp->m_sb.sb_imax_pct;
 		do_div(icount, 100);
 		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
 	} else
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index e9727d0a541a..a9874d9dcf3d 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -432,17 +432,18 @@ xfs_update_alignment(xfs_mount_t *mp)
  * Set the maximum inode count for this filesystem
  */
 STATIC void
-xfs_set_maxicount(xfs_mount_t *mp)
+xfs_set_maxicount(
+	struct xfs_mount	*mp)
 {
-	xfs_sb_t	*sbp = &(mp->m_sb);
-	uint64_t	icount;
+	struct xfs_sb		*sbp = &mp->m_sb;
+	uint64_t		icount;
 
 	if (sbp->sb_imax_pct) {
 		/*
 		 * Make sure the maximum inode count is a multiple
 		 * of the units we allocate inodes in.
 		 */
-		icount = sbp->sb_dblocks * sbp->sb_imax_pct;
+		icount = mp->m_usable_blocks * sbp->sb_imax_pct;
 		do_div(icount, 100);
 		do_div(icount, mp->m_ialloc_blks);
 		mp->m_maxicount = (icount * mp->m_ialloc_blks)  <<
@@ -501,7 +502,7 @@ xfs_set_low_space_thresholds(
 	int i;
 
 	for (i = 0; i < XFS_LOWSP_MAX; i++) {
-		uint64_t space = mp->m_sb.sb_dblocks;
+		uint64_t space = mp->m_usable_blocks;
 
 		do_div(space, 100);
 		mp->m_low_space[i] = space * (i + 1);
@@ -542,8 +543,8 @@ xfs_check_sizes(
 	xfs_daddr_t	d;
 	int		error;
 
-	d = (xfs_daddr_t)XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks);
-	if (XFS_BB_TO_FSB(mp, d) != mp->m_sb.sb_dblocks) {
+	d = (xfs_daddr_t)XFS_FSB_TO_BB(mp, mp->m_LBA_size);
+	if (XFS_BB_TO_FSB(mp, d) != mp->m_LBA_size) {
 		xfs_warn(mp, "filesystem size mismatch detected");
 		return -EFBIG;
 	}
@@ -609,7 +610,7 @@ xfs_default_resblks(xfs_mount_t *mp)
 	 * block reservation. Hence by default we cover roughly 2000 concurrent
 	 * allocation reservations.
 	 */
-	resblks = mp->m_sb.sb_dblocks;
+	resblks = mp->m_usable_blocks;
 	do_div(resblks, 20);
 	resblks = min_t(uint64_t, resblks, 8192);
 	return resblks;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 37a6c97af394..1918a564bebf 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -79,6 +79,8 @@ typedef struct xfs_mount {
 	struct percpu_counter	m_icount;	/* allocated inodes counter */
 	struct percpu_counter	m_ifree;	/* free inodes counter */
 	struct percpu_counter	m_fdblocks;	/* free block counter */
+	xfs_rfsblock_t		m_LBA_size;	/* device address space size */
+	xfs_rfsblock_t		m_usable_blocks; /* max allocatable fs space */
 
 	struct xfs_buf		*m_sb_bp;	/* buffer for superblock */
 	char			*m_fsname;	/* filesystem name */
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 8f2e2fac4255..fefbc68ebde3 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -434,9 +434,9 @@ xfs_cui_recover(
 			op_ok = false;
 			break;
 		}
-		if (!op_ok || startblock_fsb == 0 ||
+		if (!op_ok ||
+		    !xfs_verify_fsbno(mp, startblock_fsb) ||
 		    refc->pe_len == 0 ||
-		    startblock_fsb >= mp->m_sb.sb_dblocks ||
 		    refc->pe_len >= mp->m_sb.sb_agblocks ||
 		    (refc->pe_flags & ~XFS_REFCOUNT_EXTENT_FLAGS)) {
 			/*
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index f3b139c9aa16..b83be5ceef14 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -455,9 +455,9 @@ xfs_rui_recover(
 			op_ok = false;
 			break;
 		}
-		if (!op_ok || startblock_fsb == 0 ||
+		if (!op_ok ||
+		    !xfs_verify_fsbno(mp, startblock_fsb) ||
 		    rmap->me_len == 0 ||
-		    startblock_fsb >= mp->m_sb.sb_dblocks ||
 		    rmap->me_len >= mp->m_sb.sb_agblocks ||
 		    (rmap->me_flags & ~XFS_RMAP_EXTENT_FLAGS)) {
 			/*
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 12198055c319..a4e8c313eef1 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -624,7 +624,7 @@ xfs_set_inode_alloc(
 	if (mp->m_maxicount) {
 		uint64_t	icount;
 
-		icount = sbp->sb_dblocks * sbp->sb_imax_pct;
+		icount = mp->m_usable_blocks * sbp->sb_imax_pct;
 		do_div(icount, 100);
 		icount += sbp->sb_agblocks - 1;
 		do_div(icount, sbp->sb_agblocks);
@@ -1126,7 +1126,7 @@ xfs_fs_statfs(
 	spin_lock(&mp->m_sb_lock);
 	statp->f_bsize = sbp->sb_blocksize;
 	lsize = sbp->sb_logstart ? sbp->sb_logblocks : 0;
-	statp->f_blocks = sbp->sb_dblocks - lsize;
+	statp->f_blocks = mp->m_usable_blocks - lsize;
 	spin_unlock(&mp->m_sb_lock);
 
 	statp->f_bfree = fdblocks - mp->m_alloc_set_aside;
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 10/14] xfs: hide reserved metadata space from users
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (8 preceding siblings ...)
  2017-10-26  8:33 ` [PATCH 09/14] xfs: split usable space from block device size Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26  8:33 ` [PATCH 11/14] xfs: bump XFS_IOC_FSGEOMETRY to v5 structures Dave Chinner
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

One of the major user-visible issues with the new rmap and reflink
code is that it has to reserve worst case allocation space for it's
per-AG btrees. The reservation isn't small, either - it's ~2% of
total space - and so is very noticable. Indeed, a freshly mkfs'd,
empty 500TB filesystem reports:

$ df -h /mnt/scratch
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdc        500T  9.6T  491T   2% /mnt/scratch

That 10TB of space has already be used, before the user even writes
a byte of data to it. This space can never be used by the user, so
reporting it as free space is sub-optimal. At minimum, it's going to
cause issues with space usage reporting tools that admins use.

However, now that we can track usable space seperately to the block
device size, we can make this space disappear from the filesystem
completely. That is, if the reservation also drops the maximum
usable space available to the filesystem, the above freeshly made
filesystem on the 500TB device now reports as:

$ df -h /mnt/scratch
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdc        491T   62M  491T   1% /mnt/scratch

A 491TB filesystem with just 62MB used in it.

The filesystem no longer reports the reservation as used space and
the filesystem now reports exactly what the users and admins expect:
that it is completely empty.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ag_resv.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index df3e600835e8..475512d54a0a 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -143,6 +143,28 @@ xfs_ag_resv_needed(
 	return len;
 }
 
+/*
+ * xfs_ag_resv_update_space() hides the reservations from userspace by modifying
+ * the maximum usable space in the filesystem as well as the free blocks
+ * available.  This means a truly empty filesystem will report that it's truly
+ * empty rather than reporting that it has some significant amount of free space
+ * used that the user cannot free.
+ */
+static int
+xfs_ag_resv_update_space(
+	struct xfs_mount	*mp,
+	int64_t			resv)
+{
+	int			error;
+
+	error = xfs_mod_fdblocks(mp, resv, true);
+	if (error)
+		return error;
+
+	mp->m_usable_blocks += resv;
+	return 0;
+}
+
 /* Clean out a reservation */
 static int
 __xfs_ag_resv_free(
@@ -166,7 +188,8 @@ __xfs_ag_resv_free(
 		oldresv = resv->ar_orig_reserved;
 	else
 		oldresv = resv->ar_reserved;
-	error = xfs_mod_fdblocks(pag->pag_mount, oldresv, true);
+
+	error = xfs_ag_resv_update_space(pag->pag_mount, (int64_t)oldresv);
 	resv->ar_reserved = 0;
 	resv->ar_asked = 0;
 
@@ -207,7 +230,7 @@ __xfs_ag_resv_init(
 		ask = used;
 	reserved = ask - used;
 
-	error = xfs_mod_fdblocks(mp, -(int64_t)reserved, true);
+	error = xfs_ag_resv_update_space(pag->pag_mount, -(int64_t)reserved);
 	if (error) {
 		trace_xfs_ag_resv_init_error(pag->pag_mount, pag->pag_agno,
 				error, _RET_IP_);
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 11/14] xfs: bump XFS_IOC_FSGEOMETRY to v5 structures
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (9 preceding siblings ...)
  2017-10-26  8:33 ` [PATCH 10/14] xfs: hide reserved metadata space from users Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26  8:33 ` [PATCH 12/14] xfs: convert remaingin xfs_sb_version_... checks to bool Dave Chinner
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now we have separate device size and usable block information in
the kernel, we need to get that out to userspace so that tools like
xfs_info can report it sanely.

Unfortunately, the V4 XFS_IOC_FSGEOMETRY structure is out of space
so we can't just add a new field to it for this. Hence we need to
bump the definition to V5 and and treat the V4 ioctl and structure
similar to v1 to v3.

While doing this, clean up all the definitions associated with the
XFS_IOC_FSGEOMETRY ioctl.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_fs.h | 87 ++++++++++++++++++++++++++++++++++----------------
 fs/xfs/xfs_fsops.c     |  8 +++--
 fs/xfs/xfs_fsops.h     | 15 +++++----
 fs/xfs/xfs_ioctl.c     | 47 ++++++++++-----------------
 fs/xfs/xfs_ioctl32.c   |  3 +-
 5 files changed, 92 insertions(+), 68 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index b90924104596..223ec2695678 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -136,7 +136,7 @@ typedef struct xfs_flock64 {
 /*
  * Output for XFS_IOC_FSGEOMETRY_V1
  */
-typedef struct xfs_fsop_geom_v1 {
+struct xfs_fsop_geom_v1 {
 	__u32		blocksize;	/* filesystem (data) block size */
 	__u32		rtextsize;	/* realtime extent size		*/
 	__u32		agblocks;	/* fsblocks in an AG		*/
@@ -157,12 +157,39 @@ typedef struct xfs_fsop_geom_v1 {
 	__u32		logsectsize;	/* log sector size, bytes	*/
 	__u32		rtsectsize;	/* realtime sector size, bytes	*/
 	__u32		dirblocksize;	/* directory block size, bytes	*/
-} xfs_fsop_geom_v1_t;
+};
+
+/*
+ * Output for XFS_IOC_FSGEOMETRY_V4
+ */
+struct xfs_fsop_geom_v4 {
+	__u32		blocksize;	/* filesystem (data) block size */
+	__u32		rtextsize;	/* realtime extent size		*/
+	__u32		agblocks;	/* fsblocks in an AG		*/
+	__u32		agcount;	/* number of allocation groups	*/
+	__u32		logblocks;	/* fsblocks in the log		*/
+	__u32		sectsize;	/* (data) sector size, bytes	*/
+	__u32		inodesize;	/* inode size in bytes		*/
+	__u32		imaxpct;	/* max allowed inode space(%)	*/
+	__u64		datablocks;	/* fsblocks in data subvolume	*/
+	__u64		rtblocks;	/* fsblocks in realtime subvol	*/
+	__u64		rtextents;	/* rt extents in realtime subvol*/
+	__u64		logstart;	/* starting fsblock of the log	*/
+	unsigned char	uuid[16];	/* unique id of the filesystem	*/
+	__u32		sunit;		/* stripe unit, fsblocks	*/
+	__u32		swidth;		/* stripe width, fsblocks	*/
+	__s32		version;	/* structure version		*/
+	__u32		flags;		/* superblock version flags	*/
+	__u32		logsectsize;	/* log sector size, bytes	*/
+	__u32		rtsectsize;	/* realtime sector size, bytes	*/
+	__u32		dirblocksize;	/* directory block size, bytes	*/
+	__u32		logsunit;	/* log stripe unit, bytes	*/
+};
 
 /*
  * Output for XFS_IOC_FSGEOMETRY
  */
-typedef struct xfs_fsop_geom {
+struct xfs_fsop_geom {
 	__u32		blocksize;	/* filesystem (data) block size */
 	__u32		rtextsize;	/* realtime extent size		*/
 	__u32		agblocks;	/* fsblocks in an AG		*/
@@ -183,8 +210,9 @@ typedef struct xfs_fsop_geom {
 	__u32		logsectsize;	/* log sector size, bytes	*/
 	__u32		rtsectsize;	/* realtime sector size, bytes	*/
 	__u32		dirblocksize;	/* directory block size, bytes	*/
-	__u32		logsunit;	/* log stripe unit, bytes */
-} xfs_fsop_geom_t;
+	__u32		logsunit;	/* log stripe unit, bytes	*/
+	__u64		pad[16];	/* expansion space		*/
+};
 
 /* Output for XFS_FS_COUNTS */
 typedef struct xfs_fsop_counts {
@@ -200,28 +228,30 @@ typedef struct xfs_fsop_resblks {
 	__u64  resblks_avail;
 } xfs_fsop_resblks_t;
 
-#define XFS_FSOP_GEOM_VERSION	0
-
-#define XFS_FSOP_GEOM_FLAGS_ATTR	0x0001	/* attributes in use	*/
-#define XFS_FSOP_GEOM_FLAGS_NLINK	0x0002	/* 32-bit nlink values	*/
-#define XFS_FSOP_GEOM_FLAGS_QUOTA	0x0004	/* quotas enabled	*/
-#define XFS_FSOP_GEOM_FLAGS_IALIGN	0x0008	/* inode alignment	*/
-#define XFS_FSOP_GEOM_FLAGS_DALIGN	0x0010	/* large data alignment */
-#define XFS_FSOP_GEOM_FLAGS_SHARED	0x0020	/* read-only shared	*/
-#define XFS_FSOP_GEOM_FLAGS_EXTFLG	0x0040	/* special extent flag	*/
-#define XFS_FSOP_GEOM_FLAGS_DIRV2	0x0080	/* directory version 2	*/
-#define XFS_FSOP_GEOM_FLAGS_LOGV2	0x0100	/* log format version 2	*/
-#define XFS_FSOP_GEOM_FLAGS_SECTOR	0x0200	/* sector sizes >1BB	*/
-#define XFS_FSOP_GEOM_FLAGS_ATTR2	0x0400	/* inline attributes rework */
-#define XFS_FSOP_GEOM_FLAGS_PROJID32	0x0800	/* 32-bit project IDs	*/
-#define XFS_FSOP_GEOM_FLAGS_DIRV2CI	0x1000	/* ASCII only CI names	*/
-#define XFS_FSOP_GEOM_FLAGS_LAZYSB	0x4000	/* lazy superblock counters */
-#define XFS_FSOP_GEOM_FLAGS_V5SB	0x8000	/* version 5 superblock */
-#define XFS_FSOP_GEOM_FLAGS_FTYPE	0x10000	/* inode directory types */
-#define XFS_FSOP_GEOM_FLAGS_FINOBT	0x20000	/* free inode btree */
-#define XFS_FSOP_GEOM_FLAGS_SPINODES	0x40000	/* sparse inode chunks	*/
-#define XFS_FSOP_GEOM_FLAGS_RMAPBT	0x80000	/* reverse mapping btree */
-#define XFS_FSOP_GEOM_FLAGS_REFLINK	0x100000 /* files can share blocks */
+#define XFS_FSOP_GEOM_VERSION		0
+#define XFS_FSOP_GEOM_VERSION_V5	5
+
+#define XFS_FSOP_GEOM_FLAGS_ATTR	(1 << 0)  /* attributes in use	   */
+#define XFS_FSOP_GEOM_FLAGS_NLINK	(1 << 1)  /* 32-bit nlink values   */
+#define XFS_FSOP_GEOM_FLAGS_QUOTA	(1 << 2)  /* quotas enabled	   */
+#define XFS_FSOP_GEOM_FLAGS_IALIGN	(1 << 3)  /* inode alignment	   */
+#define XFS_FSOP_GEOM_FLAGS_DALIGN	(1 << 4)  /* large data alignment  */
+#define XFS_FSOP_GEOM_FLAGS_SHARED	(1 << 5)  /* read-only shared	   */
+#define XFS_FSOP_GEOM_FLAGS_EXTFLG	(1 << 6)  /* special extent flag   */
+#define XFS_FSOP_GEOM_FLAGS_DIRV2	(1 << 7)  /* directory version 2   */
+#define XFS_FSOP_GEOM_FLAGS_LOGV2	(1 << 8)  /* log format version 2  */
+#define XFS_FSOP_GEOM_FLAGS_SECTOR	(1 << 9)  /* sector sizes >1BB	   */
+#define XFS_FSOP_GEOM_FLAGS_ATTR2	(1 << 10) /* inline attributes rework */
+#define XFS_FSOP_GEOM_FLAGS_PROJID32	(1 << 11) /* 32-bit project IDs	   */
+#define XFS_FSOP_GEOM_FLAGS_DIRV2CI	(1 << 12) /* ASCII only CI names   */
+	/*  -- Do not use --		(1 << 13)    SGI parent pointers   */
+#define XFS_FSOP_GEOM_FLAGS_LAZYSB	(1 << 14) /* lazy superblock counters */
+#define XFS_FSOP_GEOM_FLAGS_V5SB	(1 << 15) /* version 5 superblock  */
+#define XFS_FSOP_GEOM_FLAGS_FTYPE	(1 << 16) /* inode directory types */
+#define XFS_FSOP_GEOM_FLAGS_FINOBT	(1 << 17) /* free inode btree	   */
+#define XFS_FSOP_GEOM_FLAGS_SPINODES	(1 << 18) /* sparse inode chunks   */
+#define XFS_FSOP_GEOM_FLAGS_RMAPBT	(1 << 19) /* reverse mapping btree */
+#define XFS_FSOP_GEOM_FLAGS_REFLINK	(1 << 20) /* files can share blocks */
 
 /*
  * Minimum and maximum sizes need for growth checks.
@@ -618,8 +648,9 @@ struct xfs_scrub_metadata {
 #define XFS_IOC_FSSETDM_BY_HANDLE    _IOW ('X', 121, struct xfs_fsop_setdm_handlereq)
 #define XFS_IOC_ATTRLIST_BY_HANDLE   _IOW ('X', 122, struct xfs_fsop_attrlist_handlereq)
 #define XFS_IOC_ATTRMULTI_BY_HANDLE  _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
-#define XFS_IOC_FSGEOMETRY	     _IOR ('X', 124, struct xfs_fsop_geom)
+#define XFS_IOC_FSGEOMETRY_V4	     _IOR ('X', 124, struct xfs_fsop_geom_v4)
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, uint32_t)
+#define XFS_IOC_FSGEOMETRY	     _IOR ('X', 126, struct xfs_fsop_geom)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index f33c74f2e925..b1659e535f7a 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -51,8 +51,8 @@
 
 int
 xfs_fs_geometry(
-	xfs_mount_t		*mp,
-	xfs_fsop_geom_t		*geo,
+	struct xfs_mount	*mp,
+	struct xfs_fsop_geom	*geo,
 	int			new_version)
 {
 
@@ -123,6 +123,10 @@ xfs_fs_geometry(
 				XFS_FSOP_GEOM_FLAGS_LOGV2 : 0);
 		geo->logsunit = mp->m_sb.sb_logsunit;
 	}
+
+	if (new_version >= 5)
+		geo->version = XFS_FSOP_GEOM_VERSION_V5;
+
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index 2954c13a3acd..92d4fa2f2a2f 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -18,13 +18,14 @@
 #ifndef __XFS_FSOPS_H__
 #define	__XFS_FSOPS_H__
 
-extern int xfs_fs_geometry(xfs_mount_t *mp, xfs_fsop_geom_t *geo, int nversion);
-extern int xfs_growfs_data(xfs_mount_t *mp, xfs_growfs_data_t *in);
-extern int xfs_growfs_log(xfs_mount_t *mp, xfs_growfs_log_t *in);
-extern int xfs_fs_counts(xfs_mount_t *mp, xfs_fsop_counts_t *cnt);
-extern int xfs_reserve_blocks(xfs_mount_t *mp, uint64_t *inval,
-				xfs_fsop_resblks_t *outval);
-extern int xfs_fs_goingdown(xfs_mount_t *mp, uint32_t inflags);
+extern int xfs_fs_geometry(struct xfs_mount *mp, struct xfs_fsop_geom *geo,
+			   int nversion);
+extern int xfs_growfs_data(struct xfs_mount *mp, struct xfs_growfs_data *in);
+extern int xfs_growfs_log(struct xfs_mount *mp, struct xfs_growfs_log *in);
+extern int xfs_fs_counts(struct xfs_mount *mp, struct xfs_fsop_counts *cnt);
+extern int xfs_reserve_blocks(struct xfs_mount *mp, uint64_t *inval,
+				struct xfs_fsop_resblks *outval);
+extern int xfs_fs_goingdown(struct xfs_mount *mp, uint32_t inflags);
 
 extern int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
 extern int xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 5a84f58816d3..67a8f084146f 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -801,45 +801,31 @@ xfs_ioc_bulkstat(
 	return 0;
 }
 
-STATIC int
-xfs_ioc_fsgeometry_v1(
-	xfs_mount_t		*mp,
-	void			__user *arg)
-{
-	xfs_fsop_geom_t         fsgeo;
-	int			error;
-
-	error = xfs_fs_geometry(mp, &fsgeo, 3);
-	if (error)
-		return error;
-
-	/*
-	 * Caller should have passed an argument of type
-	 * xfs_fsop_geom_v1_t.  This is a proper subset of the
-	 * xfs_fsop_geom_t that xfs_fs_geometry() fills in.
-	 */
-	if (copy_to_user(arg, &fsgeo, sizeof(xfs_fsop_geom_v1_t)))
-		return -EFAULT;
-	return 0;
-}
-
 STATIC int
 xfs_ioc_fsgeometry(
 	xfs_mount_t		*mp,
-	void			__user *arg)
+	void			__user *arg,
+	int			version)
 {
-	xfs_fsop_geom_t		fsgeo;
+	struct xfs_fsop_geom	fsgeo;
+	size_t			len;
 	int			error;
 
-	error = xfs_fs_geometry(mp, &fsgeo, 4);
+	error = xfs_fs_geometry(mp, &fsgeo, version);
 	if (error)
 		return error;
 
-	if (copy_to_user(arg, &fsgeo, sizeof(fsgeo)))
+	if (version <= 3)
+		len = sizeof(struct xfs_fsop_geom_v1);
+	else if (version == 4)
+		len = sizeof(struct xfs_fsop_geom_v4);
+	else
+		len = sizeof(fsgeo);
+
+	if (copy_to_user(arg, &fsgeo, len))
 		return -EFAULT;
 	return 0;
 }
-
 /*
  * Linux extended inode flags interface.
  */
@@ -1870,10 +1856,11 @@ xfs_file_ioctl(
 		return xfs_ioc_bulkstat(mp, cmd, arg);
 
 	case XFS_IOC_FSGEOMETRY_V1:
-		return xfs_ioc_fsgeometry_v1(mp, arg);
-
+		return xfs_ioc_fsgeometry(mp, arg, 3);
+	case XFS_IOC_FSGEOMETRY_V4:
+		return xfs_ioc_fsgeometry(mp, arg, 4);
 	case XFS_IOC_FSGEOMETRY:
-		return xfs_ioc_fsgeometry(mp, arg);
+		return xfs_ioc_fsgeometry(mp, arg, 5);
 
 	case XFS_IOC_GETVERSION:
 		return put_user(inode->i_generation, (int __user *)arg);
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index 35c79e246fde..5033fdd2c9ab 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -63,7 +63,7 @@ xfs_compat_ioc_fsgeometry_v1(
 	struct xfs_mount	  *mp,
 	compat_xfs_fsop_geom_v1_t __user *arg32)
 {
-	xfs_fsop_geom_t		  fsgeo;
+	struct xfs_fsop_geom	  fsgeo;
 	int			  error;
 
 	error = xfs_fs_geometry(mp, &fsgeo, 3);
@@ -540,6 +540,7 @@ xfs_file_compat_ioctl(
 	switch (cmd) {
 	/* No size or alignment issues on any arch */
 	case XFS_IOC_DIOINFO:
+	case XFS_IOC_FSGEOMETRY_V4:
 	case XFS_IOC_FSGEOMETRY:
 	case XFS_IOC_FSGETXATTR:
 	case XFS_IOC_FSSETXATTR:
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 12/14] xfs: convert remaingin xfs_sb_version_... checks to bool
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (10 preceding siblings ...)
  2017-10-26  8:33 ` [PATCH 11/14] xfs: bump XFS_IOC_FSGEOMETRY to v5 structures Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26 16:03   ` Darrick J. Wong
  2017-10-26  8:33 ` [PATCH 13/14] xfs: add suport for "thin space" filesystems Dave Chinner
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Some were missed in the pass that converted the function return
values from int to bool. Update the remaining ones for consistency.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index d4d9bef20c3a..3fb6d2a96d36 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -505,12 +505,12 @@ xfs_sb_has_incompat_log_feature(
 /*
  * V5 superblock specific feature checks
  */
-static inline int xfs_sb_version_hascrc(struct xfs_sb *sbp)
+static inline bool xfs_sb_version_hascrc(struct xfs_sb *sbp)
 {
 	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5;
 }
 
-static inline int xfs_sb_version_has_pquotino(struct xfs_sb *sbp)
+static inline bool xfs_sb_version_has_pquotino(struct xfs_sb *sbp)
 {
 	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5;
 }
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 13/14] xfs: add suport for "thin space" filesystems
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (11 preceding siblings ...)
  2017-10-26  8:33 ` [PATCH 12/14] xfs: convert remaingin xfs_sb_version_... checks to bool Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26  8:33 ` [PATCH 14/14] xfs: add growfs support for changing usable blocks Dave Chinner
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

With separation of the device size from the usuable space, and the
metadata reservation space hidden from the user, we can now add
direct manipulation of the usable space. This involves modifying the
on-disk superblock to store the maximum usable space and allowing
growfs to change the usable space rather than the device size.

This patch adds the feature bit and superblock support for
storing the maximum usable space, growfs will be done separately.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_format.h | 22 ++++++++++++----
 fs/xfs/libxfs/xfs_fs.h     |  4 ++-
 fs/xfs/libxfs/xfs_sb.c     | 62 ++++++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_fsops.c         |  7 +++++-
 fs/xfs/xfs_ondisk.h        |  2 +-
 5 files changed, 76 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 3fb6d2a96d36..38972cd7b9e2 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -105,7 +105,7 @@ struct xfs_ifork;
 typedef struct xfs_sb {
 	uint32_t	sb_magicnum;	/* magic number == XFS_SB_MAGIC */
 	uint32_t	sb_blocksize;	/* logical block size, bytes */
-	xfs_rfsblock_t	sb_dblocks;	/* number of data blocks */
+	xfs_rfsblock_t	sb_dblocks;	/* number of data blocks in device LBA */
 	xfs_rfsblock_t	sb_rblocks;	/* number of realtime blocks */
 	xfs_rtblock_t	sb_rextents;	/* number of realtime extents */
 	uuid_t		sb_uuid;	/* user-visible file system unique id */
@@ -184,6 +184,8 @@ typedef struct xfs_sb {
 	xfs_lsn_t	sb_lsn;		/* last write sequence */
 	uuid_t		sb_meta_uuid;	/* metadata file system unique id */
 
+	uint64_t	sb_usable_dblocks; /* usable space limit */
+
 	/* must be padded to 64 bit alignment */
 } xfs_sb_t;
 
@@ -196,7 +198,7 @@ typedef struct xfs_sb {
 typedef struct xfs_dsb {
 	__be32		sb_magicnum;	/* magic number == XFS_SB_MAGIC */
 	__be32		sb_blocksize;	/* logical block size, bytes */
-	__be64		sb_dblocks;	/* number of data blocks */
+	__be64		sb_dblocks;	/* number of data blocks in device LBA */
 	__be64		sb_rblocks;	/* number of realtime blocks */
 	__be64		sb_rextents;	/* number of realtime extents */
 	uuid_t		sb_uuid;	/* user-visible file system unique id */
@@ -271,6 +273,8 @@ typedef struct xfs_dsb {
 	__be64		sb_lsn;		/* last write sequence */
 	uuid_t		sb_meta_uuid;	/* metadata file system unique id */
 
+	uint64_t	sb_usable_dblocks; /* usable space limit */
+
 	/* must be padded to 64 bit alignment */
 } xfs_dsb_t;
 
@@ -478,10 +482,12 @@ xfs_sb_has_ro_compat_feature(
 #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
 #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
 #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
+#define XFS_SB_FEAT_INCOMPAT_THINSPACE	(1 << 3)	/* usable space limited */
 #define XFS_SB_FEAT_INCOMPAT_ALL \
-		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
-		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
-		 XFS_SB_FEAT_INCOMPAT_META_UUID)
+		(XFS_SB_FEAT_INCOMPAT_FTYPE	| \
+		 XFS_SB_FEAT_INCOMPAT_SPINODES	| \
+		 XFS_SB_FEAT_INCOMPAT_META_UUID	| \
+		 XFS_SB_FEAT_INCOMPAT_THINSPACE)
 
 #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
 static inline bool
@@ -559,6 +565,12 @@ static inline bool xfs_sb_version_hasreflink(struct xfs_sb *sbp)
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK);
 }
 
+static inline bool xfs_sb_version_hasthinspace(struct xfs_sb *sbp)
+{
+	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+		(sbp->sb_features_ro_compat & XFS_SB_FEAT_INCOMPAT_THINSPACE);
+}
+
 /*
  * end of superblock version macros
  */
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 223ec2695678..9fad678cae48 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -211,7 +211,8 @@ struct xfs_fsop_geom {
 	__u32		rtsectsize;	/* realtime sector size, bytes	*/
 	__u32		dirblocksize;	/* directory block size, bytes	*/
 	__u32		logsunit;	/* log stripe unit, bytes	*/
-	__u64		pad[16];	/* expansion space		*/
+	__u64		usable_dblocks;	/* usable space limit, fsblocks	*/
+	__u64		pad[15];	/* expansion space		*/
 };
 
 /* Output for XFS_FS_COUNTS */
@@ -252,6 +253,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_SPINODES	(1 << 18) /* sparse inode chunks   */
 #define XFS_FSOP_GEOM_FLAGS_RMAPBT	(1 << 19) /* reverse mapping btree */
 #define XFS_FSOP_GEOM_FLAGS_REFLINK	(1 << 20) /* files can share blocks */
+#define XFS_FSOP_GEOM_FLAGS_THINSPACE	(1 << 21) /* space limited fs	   */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 87b57abeace2..9a4593970e0a 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -268,6 +268,14 @@ xfs_mount_validate_sb(
 		return -EFSCORRUPTED;
 	}
 
+	if (xfs_sb_version_hasthinspace(sbp) &&
+	    (sbp->sb_usable_dblocks > sbp->sb_dblocks ||
+	     sbp->sb_usable_dblocks < sbp->sb_fdblocks ||
+	     sbp->sb_usable_dblocks < XFS_MIN_DBLOCKS(sbp))) {
+		xfs_notice(mp, "Thinspace SB sanity check failed");
+		return -EFSCORRUPTED;
+	}
+
 	/*
 	 * Until this is fixed only page-sized or smaller data blocks work.
 	 */
@@ -417,11 +425,13 @@ __xfs_sb_from_disk(
 	to->sb_features_incompat = be32_to_cpu(from->sb_features_incompat);
 	to->sb_features_log_incompat =
 				be32_to_cpu(from->sb_features_log_incompat);
+
 	/* crc is only used on disk, not in memory; just init to 0 here. */
 	to->sb_crc = 0;
 	to->sb_spino_align = be32_to_cpu(from->sb_spino_align);
 	to->sb_pquotino = be64_to_cpu(from->sb_pquotino);
 	to->sb_lsn = be64_to_cpu(from->sb_lsn);
+
 	/*
 	 * sb_meta_uuid is only on disk if it differs from sb_uuid and the
 	 * feature flag is set; if not set we keep it only in memory.
@@ -430,9 +440,20 @@ __xfs_sb_from_disk(
 		uuid_copy(&to->sb_meta_uuid, &from->sb_meta_uuid);
 	else
 		uuid_copy(&to->sb_meta_uuid, &from->sb_uuid);
+
 	/* Convert on-disk flags to in-memory flags? */
 	if (convert_xquota)
 		xfs_sb_quota_from_disk(to);
+
+	/*
+	 * Everything in memory relies on sb_usable_dblocks having a correct
+	 * value. For non-thin filesystems, this is not set on disk but is
+	 * simply the size of the device, so use it instead.
+	 */
+	if (xfs_sb_version_hasthinspace(to))
+		to->sb_usable_dblocks = be64_to_cpu(from->sb_usable_dblocks);
+	else
+		to->sb_usable_dblocks = to->sb_dblocks;
 }
 
 void
@@ -561,19 +582,24 @@ xfs_sb_to_disk(
 	to->sb_features2 = cpu_to_be32(from->sb_features2);
 	to->sb_bad_features2 = cpu_to_be32(from->sb_bad_features2);
 
-	if (xfs_sb_version_hascrc(from)) {
-		to->sb_features_compat = cpu_to_be32(from->sb_features_compat);
-		to->sb_features_ro_compat =
-				cpu_to_be32(from->sb_features_ro_compat);
-		to->sb_features_incompat =
-				cpu_to_be32(from->sb_features_incompat);
-		to->sb_features_log_incompat =
+	if (!xfs_sb_version_hascrc(from))
+		return;
+
+	/*
+	 * V5+ fields only after this point.
+	 */
+	to->sb_features_compat = cpu_to_be32(from->sb_features_compat);
+	to->sb_features_ro_compat = cpu_to_be32(from->sb_features_ro_compat);
+	to->sb_features_incompat = cpu_to_be32(from->sb_features_incompat);
+	to->sb_features_log_incompat =
 				cpu_to_be32(from->sb_features_log_incompat);
-		to->sb_spino_align = cpu_to_be32(from->sb_spino_align);
-		to->sb_lsn = cpu_to_be64(from->sb_lsn);
-		if (xfs_sb_version_hasmetauuid(from))
-			uuid_copy(&to->sb_meta_uuid, &from->sb_meta_uuid);
-	}
+	to->sb_spino_align = cpu_to_be32(from->sb_spino_align);
+	to->sb_lsn = cpu_to_be64(from->sb_lsn);
+
+	if (xfs_sb_version_hasmetauuid(from))
+		uuid_copy(&to->sb_meta_uuid, &from->sb_meta_uuid);
+	if (xfs_sb_version_hasthinspace(from))
+		to->sb_usable_dblocks = cpu_to_be64(from->sb_usable_dblocks);
 }
 
 static int
@@ -732,7 +758,7 @@ xfs_sb_mount_common(
 	 * Set up the filesystem size and addressing limits
 	 */
 	mp->m_LBA_size = sbp->sb_dblocks;
-	mp->m_usable_blocks = sbp->sb_dblocks;
+	mp->m_usable_blocks = sbp->sb_usable_dblocks;
 
 	mp->m_alloc_mxr[0] = xfs_allocbt_maxrecs(mp, sbp->sb_blocksize, 1);
 	mp->m_alloc_mxr[1] = xfs_allocbt_maxrecs(mp, sbp->sb_blocksize, 0);
@@ -824,6 +850,16 @@ xfs_initialize_perag_data(
 	sbp->sb_ifree = ifree;
 	sbp->sb_icount = ialloc;
 	sbp->sb_fdblocks = bfree + bfreelst + btree;
+
+	/*
+	 * The aggregate free space from the AGs does not take into account the
+	 * difference between the address space size and the maximum usable
+	 * space we have configured for thinspace filesystems.  Take that into
+	 * account now.
+	 */
+	if (xfs_sb_version_hasthinspace(sbp))
+		sbp->sb_fdblocks -= sbp->sb_dblocks - sbp->sb_usable_dblocks;
+
 	spin_unlock(&mp->m_sb_lock);
 
 	xfs_reinit_percpu_counters(mp);
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index b1659e535f7a..e0565eb01c0b 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -124,8 +124,13 @@ xfs_fs_geometry(
 		geo->logsunit = mp->m_sb.sb_logsunit;
 	}
 
-	if (new_version >= 5)
+	if (new_version >= 5) {
 		geo->version = XFS_FSOP_GEOM_VERSION_V5;
+		geo->flags |=
+			(xfs_sb_version_hasthinspace(&mp->m_sb) ?
+				XFS_FSOP_GEOM_FLAGS_THINSPACE : 0);
+		geo->usable_dblocks = mp->m_sb.sb_usable_dblocks;
+	}
 
 	return 0;
 }
diff --git a/fs/xfs/xfs_ondisk.h b/fs/xfs/xfs_ondisk.h
index 0492436a053f..5ae5dac38e6f 100644
--- a/fs/xfs/xfs_ondisk.h
+++ b/fs/xfs/xfs_ondisk.h
@@ -45,7 +45,7 @@ xfs_check_ondisk_structs(void)
 	XFS_CHECK_STRUCT_SIZE(struct xfs_dinode,		176);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_disk_dquot,		104);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_dqblk,			136);
-	XFS_CHECK_STRUCT_SIZE(struct xfs_dsb,			264);
+	XFS_CHECK_STRUCT_SIZE(struct xfs_dsb,			272);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_dsymlink_hdr,		56);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_key,		4);
 	XFS_CHECK_STRUCT_SIZE(struct xfs_inobt_rec,		16);
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 14/14] xfs: add growfs support for changing usable blocks
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (12 preceding siblings ...)
  2017-10-26  8:33 ` [PATCH 13/14] xfs: add suport for "thin space" filesystems Dave Chinner
@ 2017-10-26  8:33 ` Dave Chinner
  2017-10-26 11:30   ` Amir Goldstein
  2017-10-26 11:09 ` [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Amir Goldstein
  2017-10-30 13:31 ` Brian Foster
  15 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2017-10-26  8:33 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Now that we have persistent usable block counts, we need to be able
to change them. This allows us to control thin provisioned
filesystem space usage at the filesystem level, not the block device
level.

If the grow operation grows the usable space beyond the current
LBA size of the filesystem, then we also need to physically grow the
filesystem to match the new size of the underlying device. Hence
grow behaves like it always has, expect for the fact that it wont'
grow physically until usable space would exceed the LBA size.

Being able to modify usable space also allows us to shrink the
filesystem on thin devices as easily as growing it. We simply reduce
the usable space and the free space, and we're done. The user then
needs to run a fstrim pass to ensure all the unused space in the
filesystem LBA is marked as unused by the underlying device. No data
or metadata movement is required as the underlying LBA space has not
changed.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_shared.h |   1 +
 fs/xfs/xfs_fsops.c         | 106 +++++++++++++++++++++++++++++++++++++--------
 fs/xfs/xfs_trans.c         |  21 +++++++++
 fs/xfs/xfs_trans.h         |   1 +
 4 files changed, 111 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 536fb353de03..41a34fb96047 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -106,6 +106,7 @@ int	xfs_log_calc_minimum_size(struct xfs_mount *);
 #define	XFS_TRANS_SB_RBLOCKS		0x00000800
 #define	XFS_TRANS_SB_REXTENTS		0x00001000
 #define	XFS_TRANS_SB_REXTSLOG		0x00002000
+#define	XFS_TRANS_SB_USABLE_DBLOCKS	0x00004000
 
 /*
  * Here we centralize the specification of XFS meta-data buffer reference count
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index e0565eb01c0b..d0a6e723e924 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -504,7 +504,7 @@ xfs_grow_ag_headers(
 }
 
 static int
-xfs_growfs_data_private(
+xfs_growfs_data_physical(
 	xfs_mount_t		*mp,		/* mount point for filesystem */
 	xfs_growfs_data_t	*in)		/* growfs data input struct */
 {
@@ -520,11 +520,11 @@ xfs_growfs_data_private(
 	xfs_trans_t		*tp;
 	LIST_HEAD		(buffer_list);
 	struct aghdr_init_data	id = {};
+	struct xfs_owner_info	oinfo;
 
 	nb = in->newblocks;
-	if (nb < mp->m_LBA_size)
-		return -EINVAL;
-	if ((error = xfs_sb_validate_fsb_count(&mp->m_sb, nb)))
+	error = xfs_sb_validate_fsb_count(&mp->m_sb, nb);
+	if (error)
 		return error;
 	error = xfs_buf_read_uncached(mp->m_ddev_targp,
 				XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1),
@@ -539,7 +539,7 @@ xfs_growfs_data_private(
 	if (nb_mod && nb_mod < XFS_MIN_AG_BLOCKS) {
 		nagcount--;
 		nb = (xfs_rfsblock_t)nagcount * mp->m_sb.sb_agblocks;
-		if (nb < mp->m_LBA_size)
+		if (nb <= mp->m_LBA_size)
 			return -EINVAL;
 	}
 	new = nb - mp->m_LBA_size;
@@ -596,7 +596,6 @@ xfs_growfs_data_private(
 	 * There are new blocks in the old last a.g.
 	 */
 	if (new) {
-		struct xfs_owner_info	oinfo;
 
 		/*
 		 * Change the agi length.
@@ -649,9 +648,12 @@ xfs_growfs_data_private(
 	 */
 	if (nagcount > oagcount)
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
-	if (nb > mp->m_LBA_size)
-		xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS,
-				 nb - mp->m_LBA_size);
+	if (nb > mp->m_LBA_size) {
+		int64_t delta = nb - mp->m_sb.sb_dblocks;
+
+		xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta);
+		xfs_trans_mod_sb(tp, XFS_TRANS_SB_USABLE_DBLOCKS, delta);
+	}
 	if (id.nfree)
 		xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
 	xfs_trans_set_sync(tp);
@@ -660,13 +662,12 @@ xfs_growfs_data_private(
 		return error;
 
 	/* New allocation groups fully initialized, so update mount struct */
-	mp->m_usable_blocks = mp->m_sb.sb_dblocks;
-	mp->m_LBA_size = mp->m_sb.sb_dblocks;
-
 	if (nagimax)
 		mp->m_maxagi = nagimax;
-	xfs_set_low_space_thresholds(mp);
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+
+	/* This is a physical grow so the usable size matches the device size */
+	mp->m_LBA_size = mp->m_sb.sb_dblocks;
+	mp->m_usable_blocks = mp->m_LBA_size;
 
 	/*
 	 * If we expanded the last AG, free the per-AG reservation
@@ -801,6 +802,61 @@ xfs_growfs_update_superblocks(
 	return saved_error ? saved_error : error;
 }
 
+/*
+ * For thin filesystems, first we adjust the logical size of the filesystem
+ * to match the desired change. If the filesystem is physically not large
+ * enough, then we grow to the maximum logical size and leave the rest to
+ * the physical grow step. We also leave the the secondary superblock update
+ * to the physical grow step.
+ */
+static int
+xfs_growfs_data_thinspace(
+	struct xfs_mount	*mp,
+	struct xfs_growfs_data	*in)
+{
+	struct xfs_sb		*sbp = &mp->m_sb;
+	struct xfs_trans	*tp;
+	int64_t			delta;
+	int			error;
+
+	if (!xfs_sb_version_hasthinspace(sbp))
+		return 0;
+
+	delta = in->newblocks - sbp->sb_usable_dblocks;
+	if (!delta)
+		return 0;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata,
+			XFS_GROWFS_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp);
+	if (error)
+		return error;
+
+	/* grow to maximum logical size */
+	if (delta > 0) {
+		delta = min_t(int64_t, delta,
+			     sbp->sb_dblocks - sbp->sb_usable_dblocks);
+	}
+
+	/*
+	 * Modify incore free block counter. Shrink will return ENOSPC here if
+	 * there isn't free space available to shrink the amount requested.
+	 * We need this ENOSPC check here, which is why we can't use
+	 * xfs_trans_mod_sb() for this set of superblock modifications.
+	 */
+	error = xfs_mod_fdblocks(mp, delta, false);
+	if (error) {
+		xfs_trans_cancel(tp);
+		return error;
+	}
+
+	/* Update the new size and log the superblock. */
+	sbp->sb_usable_dblocks += delta;
+	mp->m_usable_blocks += delta;
+	xfs_log_sb(tp);
+	xfs_trans_set_sync(tp);
+	return xfs_trans_commit(tp);
+}
+
 /*
  * protected versions of growfs function acquire and release locks on the mount
  * point - exported through ioctls: XFS_IOC_FSGROWFSDATA, XFS_IOC_FSGROWFSLOG,
@@ -822,12 +878,24 @@ xfs_growfs_data(
 	if (in->imaxpct != mp->m_sb.sb_imax_pct) {
 		error = xfs_growfs_imaxpct(mp, in->imaxpct);
 		if (error)
-			goto out_error;
+			goto out_unlock;
 	}
 
-	error = xfs_growfs_data_private(mp, in);
+	error = xfs_growfs_data_thinspace(mp, in);
 	if (error)
-		goto out_error;
+		goto out_unlock;
+
+	/*
+	 * For thinspace filesystems, we can shrink the logical size and hence
+	 * newblocks can be less than the sb_dblocks. Shrinks will be done
+	 * entirely in thinspace, so only do a physical grow if it is needed.
+	 */
+	if (!xfs_sb_version_hasthinspace(&mp->m_sb) ||
+	    in->newblocks > mp->m_LBA_size) {
+		error = xfs_growfs_data_physical(mp, in);
+		if (error)
+			goto out_unlock;
+	}
 
 	/*
 	 * Post growfs calculations needed to reflect new state in operations
@@ -838,13 +906,15 @@ xfs_growfs_data(
 		mp->m_maxicount = icount << mp->m_sb.sb_inopblog;
 	} else
 		mp->m_maxicount = 0;
+	xfs_set_low_space_thresholds(mp);
+	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
 
 	/*
 	 * Update secondary superblocks now the physical grow has completed
 	 */
 	error = xfs_growfs_update_superblocks(mp);
 
-out_error:
+out_unlock:
 	/*
 	 * Increment the generation unconditionally, the error could be from
 	 * updating the secondary superblocks, in which case the new size
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index a87f657f59c9..eb1658deacd6 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -391,6 +391,9 @@ xfs_trans_mod_sb(
 	case XFS_TRANS_SB_REXTSLOG:
 		tp->t_rextslog_delta += delta;
 		break;
+	case XFS_TRANS_SB_USABLE_DBLOCKS:
+		tp->t_usable_dblock_delta += delta;
+		break;
 	default:
 		ASSERT(0);
 		return;
@@ -477,6 +480,15 @@ xfs_trans_apply_sb_deltas(
 		whole = 1;
 	}
 
+	/* Only modify the thinspace sb fields if enabled */
+	if (xfs_sb_version_hasthinspace(&tp->t_mountp->m_sb) &&
+	    tp->t_usable_dblock_delta) {
+		be64_add_cpu(&sbp->sb_usable_dblocks,
+			     tp->t_usable_dblock_delta);
+		whole = 1;
+	}
+
+
 	xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SB_BUF);
 	if (whole)
 		/*
@@ -659,9 +671,18 @@ xfs_trans_unreserve_and_mod_sb(
 		if (error)
 			goto out_undo_rextents;
 	}
+	if (tp->t_usable_dblock_delta != 0) {
+		error = xfs_sb_mod64(&mp->m_sb.sb_usable_dblocks,
+				     tp->t_usable_dblock_delta);
+		if (error)
+			goto out_undo_rextslog;
+	}
 	spin_unlock(&mp->m_sb_lock);
 	return;
 
+out_undo_rextslog:
+	if (tp->t_rextslog_delta)
+		xfs_sb_mod8(&mp->m_sb.sb_rextslog, -tp->t_rextslog_delta);
 out_undo_rextents:
 	if (tp->t_rextents_delta)
 		xfs_sb_mod64(&mp->m_sb.sb_rextents, -tp->t_rextents_delta);
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 815b53d20e26..f8c816956ba2 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -131,6 +131,7 @@ typedef struct xfs_trans {
 	int64_t			t_rblocks_delta;/* superblock rblocks change */
 	int64_t			t_rextents_delta;/* superblocks rextents chg */
 	int64_t			t_rextslog_delta;/* superblocks rextslog chg */
+	int64_t			t_usable_dblock_delta;/* usable space */
 	struct list_head	t_items;	/* log item descriptors */
 	struct list_head	t_busy;		/* list of busy extents */
 	unsigned long		t_pflags;	/* saved process flags state */
-- 
2.15.0.rc0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (13 preceding siblings ...)
  2017-10-26  8:33 ` [PATCH 14/14] xfs: add growfs support for changing usable blocks Dave Chinner
@ 2017-10-26 11:09 ` Amir Goldstein
  2017-10-26 12:35   ` Dave Chinner
  2017-10-30 13:31 ` Brian Foster
  15 siblings, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2017-10-26 11:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
> This patchset is aimed at filesystems that are installed on sparse
> block devices, a.k.a thin provisioned devices. The aim of the
> patchset is to bring the space management aspect of the storage
> stack up into the filesystem rather than keeping it below the
> filesystem where users and the filesystem have no clue they are
> about to run out of space.
>
> The idea is that thin block devices will be massively
> over-provisioned giving the filesystem a large block device address
> space to manage, but the filesystem presents itself as a much
> smaller filesystem. That is, the space the filesystem presents the
> users is much lower than the what the address space teh block device
> provides.
>
> This somewhat turns traditional thin provisioning on it's head.
> Admins are used to lying through their teeth to users about how much
> space they have available, and then they hope to hell that users
> never try to store as much data as they've been "provisioned" with.
> As a result, the traditional failure case is the block device
> running out of space all of a sudden and the filesystem and
> users wondering WTF just went wrong with their system.
>
> Moving the space management up into the filesystem by itself doesn't
> solve this problem - the thin storage pools can still be
> over-committed - but it does allow a new way of managing the space.
> Essentially, growing or shrinking a thin filesystem is an
> operation that only takes a couple of milliseconds to do because
> it's just an accounting trick. It's far less complex than creating
> a new file, or even reading data from a file.
>
> Freeing unused space from the filesystem isn't done during a shrink
> operation. It is done through discard operations, either dynamically
> via the discard mount option or, preferrably, by an fstrim
> invocation. This means freeing space in the thin pool is not in any
> way related to the management of the filesystem size and space
> enforcement even during a grow or shrink operation.
>
> What it means is that the filesystem controls the amount of active
> data the user can have in the thin pool. The thin pool usage may be
> more or less, depending on snapshots, deduplication,
> freed-but-not-discarded space, etc. And because of how low the
> overhead of changing the accounting is, users don't need to be given
> a filesystem with all the space they might need once in a blue moon.
> It is trivial to expand when need, and shrink and release when the
> data is removed.
>
> Yes, the underlying thin device that the filesystem sits on gets
> provisioned at the "once in a blue moon" size that is requested,
> but until that space is needed the filesystem can run at low amounts
> of reported free space and so prevent the likelyhood of sudden
> thin device pool depletion.
>
> Normally, running a filesysetm for low periods of time at low
> amounts of free space is a bad thing. However, for a thin
> filesystem, a low amount of usable free space doesn't mean the
> filesystem is running near full. The filesystem still has the full
> block device address to work with, so has oodles of contiguous free
> space hidden from the user. hence it's not until the thin filesystem
> grows to be near "non-thin" and is near full that the traditional
> "running near ENOSPC" problems arise.
>
> How to stop that from ever happening? e.g. Some one needs 100GB of
> space now, but maybe much more than that in a year. So provision a
> 10TB thin block device and put a 100GB thin filesystem on it.
> Problems won't arise until it's been grown to 100x it's original
> size.
>
> Yeah, it all requires thinking about the way storage is provisioned
> and managed a little bit differently, but the key point to realise
> is that grow and shrink effectively become free operations on
> thin devices if the filesystem is aware that it's on a thin device.
>
> The patchset has several parts to it. It is built on a 4.14-rc5
> kernel with for-next and Darrick's scrub tree from a couple of days
> ago merged into it.
>
> The first part of teh series is a growfs refactoring. This can
> probably stand alone, and the idea is to move the refactored
> infrastructure into libxfs so it can be shared with mkfs. This also
> cleans up a lot of the cruft in growfs and so makes it much easier
> to add the changes later in the series.
>
> The second part of the patchset moves the functionality of
> sb_dblocks into the struct xfs_mount. This provides the separation
> of address space checks and capacty related calculations that the
> thinspace mods require. This also fixes the problem of freshly made,
> empty filesystems reporting 2% of the space as used.
>
> The XFS_IOC_FSGEOMETRY ioctl needed to be bumped to a new version
> because the structure needed growing.
>
> Finally, there's the patches that provide thinspace support and the
> growfs mods needed to grow and shrink.
>
> I've smoke tested the non-thinspace code paths (running auto tests
> on a scrub enabled kernel+userspace right now) as I haven't updated
> the userspace code to exercise the thinp code paths yet. I know the
> concept works, but my userspace code has an older on-disk format
> from the prototype so it will take me a couple of days to update and
> work out how to get fstests to integrate it reliably. So this is
> mainly a heads-up RFC patchset....
>
> Comments, thoughts, flames all welcome....
>

This proposal is very interesting outside the scope of xfs, so I hope you
don't mind I've CC'ed fsdevel.

I am thinking how a slightly similar approach could be used to online shrink
the physical size for filesystems that are not on thin provisioned devices:

- Set/get a geometry variable of "agsoftlimit" (better names are welcome)
  which is <= agcount.
- agsoftlimit < agcount means that free space of AG > agsoftlimit is zero,
  so total disk space usage will not show this space as available user space.
- inode and block allocators will avoid dipping into the high AG pool,
  expect for metadata block needed for freeing high AG inodes/blocks.
- A variant of xfs_fsr (or e4defrag for that matter) could "migrate" inodes
  and/or blocks from high to low AGs.
- Migrating directories is quite different than migrating files, but doable.
- Finally, on XFS_IOC_FSGROWFSDATA, if shrinking filesystem size and
  high AG usage counters are zero, then physical size can be shrunk
  as down as agsoftlimit instead of reducing usable_blocks.

With this, xfs can gain physical shrink support and ext4 can gain online
(and safe) shrink support.

Assuming that this idea is not shot down on sight, the only implication
I can think of w.r.t your current patches is leaving enough room in new APIs
to accomodate this prospect functionality.

You have already reserved 15 u64 in geometry V5 ioctl struct, so that's good.
You have not changed XFS_IOC_FSGROWFSDATA at all, so going forward
the ambiguity of physical shrink vs. virtual shrink could either be determined
by heuristics (shrink physical if usable == physical > agsoftlimit) or a new
ioctl would be introduced to disambiguate the intention.
I have a suggestion for 3rd option, but I'll post it on the relevant patch.


Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 14/14] xfs: add growfs support for changing usable blocks
  2017-10-26  8:33 ` [PATCH 14/14] xfs: add growfs support for changing usable blocks Dave Chinner
@ 2017-10-26 11:30   ` Amir Goldstein
  2017-10-26 12:48     ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2017-10-26 11:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Now that we have persistent usable block counts, we need to be able
> to change them. This allows us to control thin provisioned
> filesystem space usage at the filesystem level, not the block device
> level.
>
> If the grow operation grows the usable space beyond the current
> LBA size of the filesystem, then we also need to physically grow the
> filesystem to match the new size of the underlying device. Hence
> grow behaves like it always has, expect for the fact that it wont'
> grow physically until usable space would exceed the LBA size.
>
> Being able to modify usable space also allows us to shrink the
> filesystem on thin devices as easily as growing it. We simply reduce
> the usable space and the free space, and we're done. The user then
> needs to run a fstrim pass to ensure all the unused space in the
> filesystem LBA is marked as unused by the underlying device. No data
> or metadata movement is required as the underlying LBA space has not
> changed.
>
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>

With this change, behavior of userspace program that tried to shrink filesystem
size will change from -EINVAL to success for filesystems that were configured
to allow that. But unmodified userspace program may still be caught by surprise
from this success return code that was never excersized in the past.

I have also argued elsewhere that the fact that the request to shrink the
"virtual" size vs. physical size is implicit and not explicit, that would hinder
future attempts to use the API as it was intended to implement physical shrink.

Suggestion:
Let userspace opt-in for the new "virtual grow" API by using the 3 upper
bytes in (struct xfs_growfs_data){.imaxpct}.
Those byes are guarantied to be zeroed by old application due to value
range check in current code, so there is plenty of room to add flags byte
and use it to request to grow USABLE_DBLOCK explicitly.

All the logic in your code stays the same (i.e. grow physical to accomodate
for growing virtual) only we stir away from being called by old apps by
mistake.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-10-26 11:09 ` [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Amir Goldstein
@ 2017-10-26 12:35   ` Dave Chinner
  2017-11-01 22:31     ` Darrick J. Wong
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2017-10-26 12:35 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-xfs, linux-fsdevel

On Thu, Oct 26, 2017 at 02:09:26PM +0300, Amir Goldstein wrote:
> On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
> > This patchset is aimed at filesystems that are installed on sparse
> > block devices, a.k.a thin provisioned devices. The aim of the
> > patchset is to bring the space management aspect of the storage
> > stack up into the filesystem rather than keeping it below the
> > filesystem where users and the filesystem have no clue they are
> > about to run out of space.
.....
> > I've smoke tested the non-thinspace code paths (running auto tests
> > on a scrub enabled kernel+userspace right now) as I haven't updated
> > the userspace code to exercise the thinp code paths yet. I know the
> > concept works, but my userspace code has an older on-disk format
> > from the prototype so it will take me a couple of days to update and
> > work out how to get fstests to integrate it reliably. So this is
> > mainly a heads-up RFC patchset....
> >
> > Comments, thoughts, flames all welcome....
> >
> 
> This proposal is very interesting outside the scope of xfs, so I hope you
> don't mind I've CC'ed fsdevel.
> 
> I am thinking how a slightly similar approach could be used to online shrink
> the physical size for filesystems that are not on thin provisioned devices:
> 
> - Set/get a geometry variable of "agsoftlimit" (better names are welcome)
>   which is <= agcount.
> - agsoftlimit < agcount means that free space of AG > agsoftlimit is zero,
>   so total disk space usage will not show this space as available user space.
> - inode and block allocators will avoid dipping into the high AG pool,
>   expect for metadata block needed for freeing high AG inodes/blocks.
> - A variant of xfs_fsr (or e4defrag for that matter) could "migrate" inodes
>   and/or blocks from high to low AGs.
> - Migrating directories is quite different than migrating files, but doable.
> - Finally, on XFS_IOC_FSGROWFSDATA, if shrinking filesystem size and
>   high AG usage counters are zero, then physical size can be shrunk
>   as down as agsoftlimit instead of reducing usable_blocks.

Yup, you've just described all the craziness that a physical shrink
requires on XFS. Lots of new user APIs, new tools to move data
around, new code to transparently migrate directories and other
metadata (like xattrs), etc.

Also, the log is placed half way through the XFS filesystem, so
unless we add code to allocate and switch to a new journal (in a
crash safe and recoverable way!) we can't shrink by more than 50%.

Also, none of the growfs code touches existing AGs - they'll have to
be scanned to determine they really are empty before they get
removed from the filesystem, and then there's the other issues like
we can't shrink to less than 2 AGs, which puts a significant minimum
shrink size on filesystems (again there's that "shrink more than 50%
requires a lot more work" problem for filesystems < 4TB).

And to do it efficiently, we really need rmap support in filesystems
so the fs can tell us what files and metadata need to be moved,
rather than having to do brute force scans to work out what needs
moving. Especially as the brute force scans can't find all the
metadata that we might need to relocate before we've emptied the
space we need to stop using.

IOWs, it's a *lot* of work, and IMO there's more work in
verification and proving that everything is crash safe, recoverable
and restartable. We've known how much work it is for years - why do
you think it hasn't been implemented? See:

http://xfs.org/index.php/Shrinking_Support

And:

http://xfs.org/index.php/Unfinished_work#The_xfs_reno_tool

And specifically follow the reference to a discussion in 2007:

https://marc.info/?l=linux-xfs&m=119131697224361&w=2

> With this, xfs can gain physical shrink support and ext4 can gain online
> (and safe) shrink support.

Yes, I estimate it'll probably take about a man-year's worth of work
to get xfs shrink to production ready from all the pieces we have
sitting around today.

> Assuming that this idea is not shot down on sight, the only implication
> I can think of w.r.t your current patches is leaving enough room in new APIs
> to accomodate this prospect functionality.

I'm not introducing any new APIs. XFS_IOC_FSGROWFSDATA already
supports shrinking and resizing/moving the log, they just aren't
implemented.

> You have already reserved 15 u64 in geometry V5 ioctl struct, so that's good.
> You have not changed XFS_IOC_FSGROWFSDATA at all, so going forward
> the ambiguity of physical shrink vs. virtual shrink could either be determined
> by heuristics

No heuristics at all. filesystems on thin devices will have a
feature bit in the superblock indicating they are thin filesystems.
If the "thinspace" bit is set, shrink is just an accounting
operation. If it's not set, then it needs to physically change the
geometry of the filesystem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 14/14] xfs: add growfs support for changing usable blocks
  2017-10-26 11:30   ` Amir Goldstein
@ 2017-10-26 12:48     ` Dave Chinner
  2017-10-26 13:32       ` Amir Goldstein
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2017-10-26 12:48 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-xfs

On Thu, Oct 26, 2017 at 02:30:22PM +0300, Amir Goldstein wrote:
> On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > Now that we have persistent usable block counts, we need to be able
> > to change them. This allows us to control thin provisioned
> > filesystem space usage at the filesystem level, not the block device
> > level.
> >
> > If the grow operation grows the usable space beyond the current
> > LBA size of the filesystem, then we also need to physically grow the
> > filesystem to match the new size of the underlying device. Hence
> > grow behaves like it always has, expect for the fact that it wont'
> > grow physically until usable space would exceed the LBA size.
> >
> > Being able to modify usable space also allows us to shrink the
> > filesystem on thin devices as easily as growing it. We simply reduce
> > the usable space and the free space, and we're done. The user then
> > needs to run a fstrim pass to ensure all the unused space in the
> > filesystem LBA is marked as unused by the underlying device. No data
> > or metadata movement is required as the underlying LBA space has not
> > changed.
> >
> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> 
> With this change, behavior of userspace program that tried to shrink filesystem
> size will change from -EINVAL to success for filesystems that were configured
> to allow that. But unmodified userspace program may still be caught by surprise
> from this success return code that was never excersized in the past.

What userspace program would be trying to shrink XFS filesystems
that doesn't already handle grow operations from the same ioctl call
returning success? Hell, I'd like to know what app is even trying to
shrink XFS filesystems...

> I have also argued elsewhere that the fact that the request to shrink the
> "virtual" size vs. physical size is implicit and not explicit, that would hinder
> future attempts to use the API as it was intended to implement physical shrink.

No, feature bits decide the action to take without any ambiguity.

> Suggestion:
> Let userspace opt-in for the new "virtual grow" API by using the 3 upper
> bytes in (struct xfs_growfs_data){.imaxpct}.
> Those byes are guarantied to be zeroed by old application due to value
> range check in current code, so there is plenty of room to add flags byte
> and use it to request to grow USABLE_DBLOCK explicitly.

What's the point of adding this complexity? AFAICT it's a solution
for a problem that doesn't exist....

> All the logic in your code stays the same (i.e. grow physical to accomodate
> for growing virtual) only we stir away from being called by old apps by
> mistake.

My care factor about old 3rd party apps that have never been able to
test that shrink code path actually succeeded because the kernel
didn't support it is pretty damn close to zero.

Actually, wait ..... Ahhhhh. I have just reached the state of Care
Factor Zero. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 14/14] xfs: add growfs support for changing usable blocks
  2017-10-26 12:48     ` Dave Chinner
@ 2017-10-26 13:32       ` Amir Goldstein
  2017-10-27 10:26         ` Amir Goldstein
  0 siblings, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2017-10-26 13:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 26, 2017 at 3:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Oct 26, 2017 at 02:30:22PM +0300, Amir Goldstein wrote:
>> On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > From: Dave Chinner <dchinner@redhat.com>
>> >
>> > Now that we have persistent usable block counts, we need to be able
>> > to change them. This allows us to control thin provisioned
>> > filesystem space usage at the filesystem level, not the block device
>> > level.
>> >
>> > If the grow operation grows the usable space beyond the current
>> > LBA size of the filesystem, then we also need to physically grow the
>> > filesystem to match the new size of the underlying device. Hence
>> > grow behaves like it always has, expect for the fact that it wont'
>> > grow physically until usable space would exceed the LBA size.
>> >
>> > Being able to modify usable space also allows us to shrink the
>> > filesystem on thin devices as easily as growing it. We simply reduce
>> > the usable space and the free space, and we're done. The user then
>> > needs to run a fstrim pass to ensure all the unused space in the
>> > filesystem LBA is marked as unused by the underlying device. No data
>> > or metadata movement is required as the underlying LBA space has not
>> > changed.
>> >
>> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
>>
>> With this change, behavior of userspace program that tried to shrink filesystem
>> size will change from -EINVAL to success for filesystems that were configured
>> to allow that. But unmodified userspace program may still be caught by surprise
>> from this success return code that was never excersized in the past.
>
> What userspace program would be trying to shrink XFS filesystems
> that doesn't already handle grow operations from the same ioctl call
> returning success? Hell, I'd like to know what app is even trying to
> shrink XFS filesystems...
>

A buggy program of course ;-)

>> I have also argued elsewhere that the fact that the request to shrink the
>> "virtual" size vs. physical size is implicit and not explicit, that would hinder
>> future attempts to use the API as it was intended to implement physical shrink.
>
> No, feature bits decide the action to take without any ambiguity.
>

The ambiguity I was referring to was of the user program's intent.
Did it request to the set filesystem size to N or filesystem usable size to N.
When growing, the difference in intent doesn't change the result.
When shrinking AND feature bit is set, the intent makes a difference.

>> Suggestion:
>> Let userspace opt-in for the new "virtual grow" API by using the 3 upper
>> bytes in (struct xfs_growfs_data){.imaxpct}.
>> Those byes are guarantied to be zeroed by old application due to value
>> range check in current code, so there is plenty of room to add flags byte
>> and use it to request to grow USABLE_DBLOCK explicitly.
>
> What's the point of adding this complexity? AFAICT it's a solution
> for a problem that doesn't exist....
>

AFAICT you are right, but API review is a bit like legal contract review -
picking to problem that don't seem to exist - until one day we realize
that they do...

>> All the logic in your code stays the same (i.e. grow physical to accomodate
>> for growing virtual) only we stir away from being called by old apps by
>> mistake.
>
> My care factor about old 3rd party apps that have never been able to
> test that shrink code path actually succeeded because the kernel
> didn't support it is pretty damn close to zero.
>
> Actually, wait ..... Ahhhhh. I have just reached the state of Care
> Factor Zero. :)
>

Look to your right. I am right there with you :)
Anyway, I think that the cost of being wrong on this one is not so high
(famous last words)

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/14] xfs: convert remaingin xfs_sb_version_... checks to bool
  2017-10-26  8:33 ` [PATCH 12/14] xfs: convert remaingin xfs_sb_version_... checks to bool Dave Chinner
@ 2017-10-26 16:03   ` Darrick J. Wong
  0 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2017-10-26 16:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 26, 2017 at 07:33:20PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Some were missed in the pass that converted the function return
> values from int to bool. Update the remaining ones for consistency.
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

> ---
>  fs/xfs/libxfs/xfs_format.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index d4d9bef20c3a..3fb6d2a96d36 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -505,12 +505,12 @@ xfs_sb_has_incompat_log_feature(
>  /*
>   * V5 superblock specific feature checks
>   */
> -static inline int xfs_sb_version_hascrc(struct xfs_sb *sbp)
> +static inline bool xfs_sb_version_hascrc(struct xfs_sb *sbp)
>  {
>  	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5;
>  }
>  
> -static inline int xfs_sb_version_has_pquotino(struct xfs_sb *sbp)
> +static inline bool xfs_sb_version_has_pquotino(struct xfs_sb *sbp)
>  {
>  	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5;
>  }
> -- 
> 2.15.0.rc0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 14/14] xfs: add growfs support for changing usable blocks
  2017-10-26 13:32       ` Amir Goldstein
@ 2017-10-27 10:26         ` Amir Goldstein
  0 siblings, 0 replies; 47+ messages in thread
From: Amir Goldstein @ 2017-10-27 10:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 26, 2017 at 4:32 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Oct 26, 2017 at 3:48 PM, Dave Chinner <david@fromorbit.com> wrote:
>> On Thu, Oct 26, 2017 at 02:30:22PM +0300, Amir Goldstein wrote:
>>> On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
>>> > From: Dave Chinner <dchinner@redhat.com>
>>> >
>>> > Now that we have persistent usable block counts, we need to be able
>>> > to change them. This allows us to control thin provisioned
>>> > filesystem space usage at the filesystem level, not the block device
>>> > level.
>>> >
>>> > If the grow operation grows the usable space beyond the current
>>> > LBA size of the filesystem, then we also need to physically grow the
>>> > filesystem to match the new size of the underlying device. Hence
>>> > grow behaves like it always has, expect for the fact that it wont'
>>> > grow physically until usable space would exceed the LBA size.
>>> >
>>> > Being able to modify usable space also allows us to shrink the
>>> > filesystem on thin devices as easily as growing it. We simply reduce
>>> > the usable space and the free space, and we're done. The user then
>>> > needs to run a fstrim pass to ensure all the unused space in the
>>> > filesystem LBA is marked as unused by the underlying device. No data
>>> > or metadata movement is required as the underlying LBA space has not
>>> > changed.
>>> >
>>> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
>>>
>>> With this change, behavior of userspace program that tried to shrink filesystem
>>> size will change from -EINVAL to success for filesystems that were configured
>>> to allow that. But unmodified userspace program may still be caught by surprise
>>> from this success return code that was never excersized in the past.
>>
>> What userspace program would be trying to shrink XFS filesystems
>> that doesn't already handle grow operations from the same ioctl call
>> returning success? Hell, I'd like to know what app is even trying to
>> shrink XFS filesystems...
>>
>
> A buggy program of course ;-)
>
>>> I have also argued elsewhere that the fact that the request to shrink the
>>> "virtual" size vs. physical size is implicit and not explicit, that would hinder
>>> future attempts to use the API as it was intended to implement physical shrink.
>>
>> No, feature bits decide the action to take without any ambiguity.
>>
>
> The ambiguity I was referring to was of the user program's intent.
> Did it request to the set filesystem size to N or filesystem usable size to N.
> When growing, the difference in intent doesn't change the result.

I take that back. I took a look at xfs_growfs.
If user requests to change only imaxpct xfs_growfs will set newblocks
to geo.datablocks, thus the user intent of "not changing data blocks"
when running an old version of xfs_growfs, is handled by current
patches as "blue moon is shining - give me the promised disk space".
Unless I am misunderstanding and geo.datablocks means usable
data block in FSGEOMETRY V4?

Don't see how you can get away without adding "my intent is to
modify usable blocks" to the API.
I hope you can prove me wrong, because the way you extended the
growfs API is very elegant.

Also, forgive me for not digging too deep myself to find the answer,
but is existing imaxpct being recalculated based on new usable_dblocks?
or does imaxpct still relative to total dblocks?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
                   ` (14 preceding siblings ...)
  2017-10-26 11:09 ` [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Amir Goldstein
@ 2017-10-30 13:31 ` Brian Foster
  2017-10-30 21:09   ` Dave Chinner
  15 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2017-10-30 13:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> This patchset is aimed at filesystems that are installed on sparse
> block devices, a.k.a thin provisioned devices. The aim of the
> patchset is to bring the space management aspect of the storage
> stack up into the filesystem rather than keeping it below the
> filesystem where users and the filesystem have no clue they are
> about to run out of space.
> 

Just a few high level comments/thoughts after a very quick pass..

...
> The patchset has several parts to it. It is built on a 4.14-rc5
> kernel with for-next and Darrick's scrub tree from a couple of days
> ago merged into it.
> 
> The first part of teh series is a growfs refactoring. This can
> probably stand alone, and the idea is to move the refactored
> infrastructure into libxfs so it can be shared with mkfs. This also
> cleans up a lot of the cruft in growfs and so makes it much easier
> to add the changes later in the series.
> 

It'd be nice to see this as a separate patchset. It does indeed seem
like it could be applied before all of the other bits are worked out.

> The second part of the patchset moves the functionality of
> sb_dblocks into the struct xfs_mount. This provides the separation
> of address space checks and capacty related calculations that the
> thinspace mods require. This also fixes the problem of freshly made,
> empty filesystems reporting 2% of the space as used.
> 

This all seems mostly sane to me. FWIW, the in-core fields (particularly
->m_LBA_size) strike me as oddly named. "LBA size" strikes me as the
addressable range of the underlying device, regardless of fs size (but I
understand that's not how it's used here). If we have ->m_usable_blocks,
why not name the other something like ->m_total_blocks or just
->m_blocks? That suggests to me that the fields are related, mutable,
accounted in units of blocks, etc. Just a nit, though.

That aside, it also seems like this part could technically stand as an
independent change. 

> The XFS_IOC_FSGEOMETRY ioctl needed to be bumped to a new version
> because the structure needed growing.
> 
> Finally, there's the patches that provide thinspace support and the
> growfs mods needed to grow and shrink.
> 

I still think there's a gap related to fs and bdev block size that needs
to be addressed with regard to preventing pool depletion, but I think
that can ultimately be handled with documentation. I also wonder a bit
about consistency between online modifications to usable blocks and fs
blocks that might have been allocated but not yet written (i.e.,
unwritten extents and the log on freshly created filesystems). IOW,
shrinking the filesystem in such cases may not limit pool allocation in
the way a user might expect.

Did we ever end up with support for bdev fallocate? If so, I wonder if
for example it would be useful to fallocate the physical log at mount or
mkfs time for thin enabled fs'. The unwritten extent case may just be a
matter of administration sanity since the filesystem shrink will
consider those as allocated blocks and thus imply that the fs expects
the ability to write them.

Finally, I tend to agree with Amir's comment with regard to
shrink/growfs... at least infosar as I understand his concern. If we do
support physical shrink in the future, what do we expect the interface
to look like in light of this change? FWIW, I also think there's an
element of design/interface consistency to the argument (aside from the
concern over the future of the physical shrink api). We've separated
total blocks from usable blocks in other userspace interfaces
(geometry). Not doing so for growfs is somewhat inconsistent, and also
creates confusion over the meaning of newblocks in different contexts.

Given that this already requires on-disk changes and the addition of a
feature bit, it seems prudent to me to update the growfs API
accordingly. Isn't a growfs new_usable_blocks field or some such all we
really need to address that concern?

Brian

> I've smoke tested the non-thinspace code paths (running auto tests
> on a scrub enabled kernel+userspace right now) as I haven't updated
> the userspace code to exercise the thinp code paths yet. I know the
> concept works, but my userspace code has an older on-disk format
> from the prototype so it will take me a couple of days to update and
> work out how to get fstests to integrate it reliably. So this is
> mainly a heads-up RFC patchset....
> 
> Comments, thoughts, flames all welcome....
> 
> Cheers,
> 
> Dave.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-10-30 13:31 ` Brian Foster
@ 2017-10-30 21:09   ` Dave Chinner
  2017-10-31  4:49     ` Amir Goldstein
  2017-10-31 11:24     ` Brian Foster
  0 siblings, 2 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-30 21:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> > This patchset is aimed at filesystems that are installed on sparse
> > block devices, a.k.a thin provisioned devices. The aim of the
> > patchset is to bring the space management aspect of the storage
> > stack up into the filesystem rather than keeping it below the
> > filesystem where users and the filesystem have no clue they are
> > about to run out of space.
> > 
> 
> Just a few high level comments/thoughts after a very quick pass..
> 
> ...
> > The patchset has several parts to it. It is built on a 4.14-rc5
> > kernel with for-next and Darrick's scrub tree from a couple of days
> > ago merged into it.
> > 
> > The first part of teh series is a growfs refactoring. This can
> > probably stand alone, and the idea is to move the refactored
> > infrastructure into libxfs so it can be shared with mkfs. This also
> > cleans up a lot of the cruft in growfs and so makes it much easier
> > to add the changes later in the series.
> > 
> 
> It'd be nice to see this as a separate patchset. It does indeed seem
> like it could be applied before all of the other bits are worked out.

Yes, i plan to do that, and with it separated also get it into shape
where it can be shared with mkfs via libxfs.

> > The second part of the patchset moves the functionality of
> > sb_dblocks into the struct xfs_mount. This provides the separation
> > of address space checks and capacty related calculations that the
> > thinspace mods require. This also fixes the problem of freshly made,
> > empty filesystems reporting 2% of the space as used.
> > 
> 
> This all seems mostly sane to me. FWIW, the in-core fields (particularly
> ->m_LBA_size) strike me as oddly named. "LBA size" strikes me as the
> addressable range of the underlying device, regardless of fs size (but I
> understand that's not how it's used here). If we have ->m_usable_blocks,
> why not name the other something like ->m_total_blocks or just
> ->m_blocks? That suggests to me that the fields are related, mutable,
> accounted in units of blocks, etc. Just a nit, though.

Yeah, I couldn't think of a better pair of names - the point I was
trying to get across is that what we call "dblocks" is actually an
assumption about the structure of the block device address space
we are working on. i.e. that it's a contiguous, linear address space
from 0 to "dblocks".

I get that "total_blocks" sounds better, but to me that's a capacity
measurement, not an indication of the size of the underlying address
space the block device has provided. m_usable_blocks is obviously a
capacity measurement, but I was trying to convey that m_LBA_size is
not a capacity measurement but an externally imposed addressing
limit.

<shrug>

I guess if I document it well enough m_total_blocks will work.


> That aside, it also seems like this part could technically stand as an
> independent change. 

*nod*

> > The XFS_IOC_FSGEOMETRY ioctl needed to be bumped to a new version
> > because the structure needed growing.
> > 
> > Finally, there's the patches that provide thinspace support and the
> > growfs mods needed to grow and shrink.
> > 
> 
> I still think there's a gap related to fs and bdev block size that needs
> to be addressed with regard to preventing pool depletion, but I think
> that can ultimately be handled with documentation. I also wonder a bit
> about consistency between online modifications to usable blocks and fs
> blocks that might have been allocated but not yet written (i.e.,
> unwritten extents and the log on freshly created filesystems). IOW,
> shrinking the filesystem in such cases may not limit pool allocation in
> the way a user might expect.
> 
> Did we ever end up with support for bdev fallocate?

Yes, we do have that.

> If so, I wonder if
> for example it would be useful to fallocate the physical log at mount or
> mkfs time for thin enabled fs'.

We zero the log and stamp it with cycle headers, so it will already
be fully allocated in th eunderlying device after running mkfs.

> The unwritten extent case may just be a
> matter of administration sanity since the filesystem shrink will
> consider those as allocated blocks and thus imply that the fs expects
> the ability to write them.

Yes, that is correct. To make things work as expected, we'll need
to propagating fallocate calls to the block device from the
filesystem when mounted. There's a bunch of stuff there that we can
add once the core "this is a thin filesystem" awareness has been
added.

> Finally, I tend to agree with Amir's comment with regard to
> shrink/growfs... at least infosar as I understand his concern. If we do
> support physical shrink in the future, what do we expect the interface
> to look like in light of this change?

I don't expect it to look any different. It's exactly the same as
growfs - thinspace filesystem will simply do a logical grow/shrink,
fat filesystems will need to do a physical grow/shrink
adding/removing AGs.

I suspect Amir is worried about the fact that I put "LBA_size"
in geom.datablocks instead of "usable_space" for thin-aware
filesystems (i.e. I just screwed up writing the new patch). Like I
said, I haven't updated the userspace stuff yet, so the thinspace
side of that hasn't been tested yet. If I screwed up xfs_growfs (and
I have because some of the tests are reporting incorrect
post-grow sizes on fat filesytsems) I tend to find out as soon as I
run it.

Right now I think using m_LBA_size and m_usable_space in the geom
structure was a mistake - they should remain the superblock values
because otherwise the hidden metadata reservations can affect what
is reported to userspace, and that's where I think the test failures
are coming from....


> FWIW, I also think there's an
> element of design/interface consistency to the argument (aside from the
> concern over the future of the physical shrink api). We've separated
> total blocks from usable blocks in other userspace interfaces
> (geometry). Not doing so for growfs is somewhat inconsistent, and also
> creates confusion over the meaning of newblocks in different contexts.

It has to be done for the geometry info so that xfs_info can report
the thinspace size of the filesytem in addition to the physical
size. It does not need to be done to make growfs work correctly.

> Given that this already requires on-disk changes and the addition of a
> feature bit, it seems prudent to me to update the growfs API
> accordingly. Isn't a growfs new_usable_blocks field or some such all we
> really need to address that concern?

I really do not see any reason for changing the growfs interface
right now. If there's a problem in future that physical shrink
introduces, we can rev the interface when the problem arises.

Cheers

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-10-30 21:09   ` Dave Chinner
@ 2017-10-31  4:49     ` Amir Goldstein
  2017-10-31 22:40       ` Dave Chinner
  2017-10-31 11:24     ` Brian Foster
  1 sibling, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2017-10-31  4:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Mon, Oct 30, 2017 at 11:09 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
>> On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
...
>> Finally, I tend to agree with Amir's comment with regard to
>> shrink/growfs... at least infosar as I understand his concern. If we do
>> support physical shrink in the future, what do we expect the interface
>> to look like in light of this change?
>
> I don't expect it to look any different. It's exactly the same as
> growfs - thinspace filesystem will simply do a logical grow/shrink,
> fat filesystems will need to do a physical grow/shrink
> adding/removing AGs.
>
> I suspect Amir is worried about the fact that I put "LBA_size"
> in geom.datablocks instead of "usable_space" for thin-aware
> filesystems (i.e. I just screwed up writing the new patch). Like I
> said, I haven't updated the userspace stuff yet, so the thinspace
> side of that hasn't been tested yet. If I screwed up xfs_growfs (and
> I have because some of the tests are reporting incorrect
> post-grow sizes on fat filesytsems) I tend to find out as soon as I
> run it.
>
> Right now I think using m_LBA_size and m_usable_space in the geom
> structure was a mistake - they should remain the superblock values
> because otherwise the hidden metadata reservations can affect what
> is reported to userspace, and that's where I think the test failures
> are coming from....
>

I see. I suppose you intend to expose m_LBA_size in a new V5 geom value.
(geom.LBA_blocks?)
Does it make sense to expose the underlying bdev size in the same V5 geom
value for fat fs?
Does it make sense to expose yet another geom value for "total_blocks"?

The interpretation of former geom.datablocks will be "dblocks soft limit"
The interpretation of new geom.LBA_blocks will be "dblocks hard limit"
The interpretation of existing growfs will be "increase dblock soft limit",
but only up to dblocks hard limit.
This interpretation would be consistent for both thin and fat fs.

A future API for physical shrink/grow can be deployed to change
"dblocks hard limit", which may involve communicating with blockdev
(e.g. LVM) via standard interface (i.e. truncate()/fallocate()) to shrink
or grow it if volume is fat and to allocate/punch it if volume is thin.

>
>> FWIW, I also think there's an
>> element of design/interface consistency to the argument (aside from the
>> concern over the future of the physical shrink api). We've separated
>> total blocks from usable blocks in other userspace interfaces
>> (geometry). Not doing so for growfs is somewhat inconsistent, and also
>> creates confusion over the meaning of newblocks in different contexts.
>
> It has to be done for the geometry info so that xfs_info can report
> the thinspace size of the filesytem in addition to the physical
> size. It does not need to be done to make growfs work correctly.
>
>> Given that this already requires on-disk changes and the addition of a
>> feature bit, it seems prudent to me to update the growfs API
>> accordingly. Isn't a growfs new_usable_blocks field or some such all we
>> really need to address that concern?
>
> I really do not see any reason for changing the growfs interface
> right now. If there's a problem in future that physical shrink
> introduces, we can rev the interface when the problem arises.
>

At the moment, I don't see a problem either.
I just feel like there may be opportunities to improve fs/volume management
integration for fat fs/volumes as well, so we need to keep them in mind when
designing the new APIs.

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-10-30 21:09   ` Dave Chinner
  2017-10-31  4:49     ` Amir Goldstein
@ 2017-10-31 11:24     ` Brian Foster
  2017-11-01  0:45       ` Dave Chinner
  1 sibling, 1 reply; 47+ messages in thread
From: Brian Foster @ 2017-10-31 11:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Oct 31, 2017 at 08:09:41AM +1100, Dave Chinner wrote:
> On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> > On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> > > This patchset is aimed at filesystems that are installed on sparse
> > > block devices, a.k.a thin provisioned devices. The aim of the
> > > patchset is to bring the space management aspect of the storage
> > > stack up into the filesystem rather than keeping it below the
> > > filesystem where users and the filesystem have no clue they are
> > > about to run out of space.
> > > 
> > 
> > Just a few high level comments/thoughts after a very quick pass..
> > 
> > ...
> > > The patchset has several parts to it. It is built on a 4.14-rc5
> > > kernel with for-next and Darrick's scrub tree from a couple of days
> > > ago merged into it.
> > > 
> > > The first part of teh series is a growfs refactoring. This can
> > > probably stand alone, and the idea is to move the refactored
> > > infrastructure into libxfs so it can be shared with mkfs. This also
> > > cleans up a lot of the cruft in growfs and so makes it much easier
> > > to add the changes later in the series.
> > > 
> > 
> > It'd be nice to see this as a separate patchset. It does indeed seem
> > like it could be applied before all of the other bits are worked out.
> 
> Yes, i plan to do that, and with it separated also get it into shape
> where it can be shared with mkfs via libxfs.
> 
> > > The second part of the patchset moves the functionality of
> > > sb_dblocks into the struct xfs_mount. This provides the separation
> > > of address space checks and capacty related calculations that the
> > > thinspace mods require. This also fixes the problem of freshly made,
> > > empty filesystems reporting 2% of the space as used.
> > > 
> > 
> > This all seems mostly sane to me. FWIW, the in-core fields (particularly
> > ->m_LBA_size) strike me as oddly named. "LBA size" strikes me as the
> > addressable range of the underlying device, regardless of fs size (but I
> > understand that's not how it's used here). If we have ->m_usable_blocks,
> > why not name the other something like ->m_total_blocks or just
> > ->m_blocks? That suggests to me that the fields are related, mutable,
> > accounted in units of blocks, etc. Just a nit, though.
> 
> Yeah, I couldn't think of a better pair of names - the point I was
> trying to get across is that what we call "dblocks" is actually an
> assumption about the structure of the block device address space
> we are working on. i.e. that it's a contiguous, linear address space
> from 0 to "dblocks".
> 
> I get that "total_blocks" sounds better, but to me that's a capacity
> measurement, not an indication of the size of the underlying address
> space the block device has provided. m_usable_blocks is obviously a
> capacity measurement, but I was trying to convey that m_LBA_size is
> not a capacity measurement but an externally imposed addressing
> limit.
> 
> <shrug>
> 
> I guess if I document it well enough m_total_blocks will work.
> 

Hmm, yeah I see what you mean. Unfortunately I can't really think of
anything aside from m_total_blocks or perhaps m_phys_blocks at the
moment.

> 
> > That aside, it also seems like this part could technically stand as an
> > independent change. 
> 
> *nod*
> 
> > > The XFS_IOC_FSGEOMETRY ioctl needed to be bumped to a new version
> > > because the structure needed growing.
> > > 
> > > Finally, there's the patches that provide thinspace support and the
> > > growfs mods needed to grow and shrink.
> > > 
> > 
> > I still think there's a gap related to fs and bdev block size that needs
> > to be addressed with regard to preventing pool depletion, but I think
> > that can ultimately be handled with documentation. I also wonder a bit
> > about consistency between online modifications to usable blocks and fs
> > blocks that might have been allocated but not yet written (i.e.,
> > unwritten extents and the log on freshly created filesystems). IOW,
> > shrinking the filesystem in such cases may not limit pool allocation in
> > the way a user might expect.
> > 
> > Did we ever end up with support for bdev fallocate?
> 
> Yes, we do have that.
> 
> > If so, I wonder if
> > for example it would be useful to fallocate the physical log at mount or
> > mkfs time for thin enabled fs'.
> 
> We zero the log and stamp it with cycle headers, so it will already
> be fully allocated in th eunderlying device after running mkfs.
> 

Ok. Taking a quick look... I see that we do zero the log, but note that
we don't stamp the entire log in the mkfs case. It's enough to set the
LSN of a single record for the kernel to detect the log has been reset.

Either way, that probably means the log is allocated or the thin device
doesn't survive mkfs. I still think it might be wise to fallocate at
mount time via that log reset/zeroed case to ensure we catch any corner
cases or landmines in the future, fwiw, but not critical.

> > The unwritten extent case may just be a
> > matter of administration sanity since the filesystem shrink will
> > consider those as allocated blocks and thus imply that the fs expects
> > the ability to write them.
> 
> Yes, that is correct. To make things work as expected, we'll need
> to propagating fallocate calls to the block device from the
> filesystem when mounted. There's a bunch of stuff there that we can
> add once the core "this is a thin filesystem" awareness has been
> added.
> 

*nod*

> > Finally, I tend to agree with Amir's comment with regard to
> > shrink/growfs... at least infosar as I understand his concern. If we do
> > support physical shrink in the future, what do we expect the interface
> > to look like in light of this change?
> 
> I don't expect it to look any different. It's exactly the same as
> growfs - thinspace filesystem will simply do a logical grow/shrink,
> fat filesystems will need to do a physical grow/shrink
> adding/removing AGs.
> 

How would you physically shrink a thin filesystem?

> I suspect Amir is worried about the fact that I put "LBA_size"
> in geom.datablocks instead of "usable_space" for thin-aware
> filesystems (i.e. I just screwed up writing the new patch). Like I
> said, I haven't updated the userspace stuff yet, so the thinspace
> side of that hasn't been tested yet. If I screwed up xfs_growfs (and
> I have because some of the tests are reporting incorrect
> post-grow sizes on fat filesytsems) I tend to find out as soon as I
> run it.
> 
> Right now I think using m_LBA_size and m_usable_space in the geom
> structure was a mistake - they should remain the superblock values
> because otherwise the hidden metadata reservations can affect what
> is reported to userspace, and that's where I think the test failures
> are coming from....
> 
> 
> > FWIW, I also think there's an
> > element of design/interface consistency to the argument (aside from the
> > concern over the future of the physical shrink api). We've separated
> > total blocks from usable blocks in other userspace interfaces
> > (geometry). Not doing so for growfs is somewhat inconsistent, and also
> > creates confusion over the meaning of newblocks in different contexts.
> 
> It has to be done for the geometry info so that xfs_info can report
> the thinspace size of the filesytem in addition to the physical
> size. It does not need to be done to make growfs work correctly.
> 

It's not a question of needing it to work correctly. It's a question of
design/interface consistency. Perhaps this loses relevance after your
changes noted above..

> > Given that this already requires on-disk changes and the addition of a
> > feature bit, it seems prudent to me to update the growfs API
> > accordingly. Isn't a growfs new_usable_blocks field or some such all we
> > really need to address that concern?
> 
> I really do not see any reason for changing the growfs interface
> right now. If there's a problem in future that physical shrink
> introduces, we can rev the interface when the problem arises.
> 

Indeed. I'm not terribly concerned about the growfs interface, but in
the interest of trying to make sure the point is clear... does that mean
we'd add a new usable_blocks field to the growfs structure if/when
physical shrink is supported? If so, doesn't that mean we'd have two
fields that are used to logically shrink the fs, depending on interface
version, or that we'd have some kernels that logically shrink based on
'newblocks' and some that don't?

The larger question I have is what do we save by not just adding
usable_blocks to growfs_data now so the interface remains
self-explanatory? That seems relatively minor with respect to the
on-disk changes.

Brian

> Cheers
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-10-31  4:49     ` Amir Goldstein
@ 2017-10-31 22:40       ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2017-10-31 22:40 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Brian Foster, linux-xfs

On Tue, Oct 31, 2017 at 06:49:01AM +0200, Amir Goldstein wrote:
> On Mon, Oct 30, 2017 at 11:09 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> >> On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> ...
> >> Finally, I tend to agree with Amir's comment with regard to
> >> shrink/growfs... at least infosar as I understand his concern. If we do
> >> support physical shrink in the future, what do we expect the interface
> >> to look like in light of this change?
> >
> > I don't expect it to look any different. It's exactly the same as
> > growfs - thinspace filesystem will simply do a logical grow/shrink,
> > fat filesystems will need to do a physical grow/shrink
> > adding/removing AGs.
> >
> > I suspect Amir is worried about the fact that I put "LBA_size"
> > in geom.datablocks instead of "usable_space" for thin-aware
> > filesystems (i.e. I just screwed up writing the new patch). Like I
> > said, I haven't updated the userspace stuff yet, so the thinspace
> > side of that hasn't been tested yet. If I screwed up xfs_growfs (and
> > I have because some of the tests are reporting incorrect
> > post-grow sizes on fat filesytsems) I tend to find out as soon as I
> > run it.
> >
> > Right now I think using m_LBA_size and m_usable_space in the geom
> > structure was a mistake - they should remain the superblock values
> > because otherwise the hidden metadata reservations can affect what
> > is reported to userspace, and that's where I think the test failures
> > are coming from....
> >
> 
> I see. I suppose you intend to expose m_LBA_size in a new V5 geom value.
> (geom.LBA_blocks?)
> Does it make sense to expose the underlying bdev size in the same V5 geom
> value for fat fs?
> Does it make sense to expose yet another geom value for "total_blocks"?

Yes, yes and yes.

> The interpretation of former geom.datablocks will be "dblocks soft limit"

No. It is unchanged in meaning: "size of filesystem".

> The interpretation of new geom.LBA_blocks will be "dblocks hard limit"

No, it's not a hard limit. It's the "size of the filesystem on-disk
geometry".

> The interpretation of existing growfs will be "increase dblock soft limit",
> but only up to dblocks hard limit.

No. The interpretation is exactly as it is now: "grow filesystem
size to N", and the kernel then determines what combination of
logical (thin) grow and physical (i.e. modifying SB/AG geometry)
grow is required.

> This interpretation would be consistent for both thin and fat fs.

Which is exactly what I'm already providing, without trying to
redefine the interface and presenting an unneccessary different in
grow/shrink behaviour to users.

> A future API for physical shrink/grow can be deployed to change
> "dblocks hard limit", which may involve communicating with blockdev
> (e.g. LVM) via standard interface (i.e. truncate()/fallocate()) to shrink
> or grow it if volume is fat and to allocate/punch it if volume is thin.

We've already got "physical grow" - xfs-growfs queries the block
device for it's size, and passes that to the kernel to physically
grow the fs to that size. We don't need the in-kernel grow
implementation to do this right now.

If you want to add dynamic block device size controls in the
filesystem grow/shrink kernel implementation, then start by
providing the fs/block device implementation to let filesystems
implement that. Then we can worry about how to present that to
userspace through the filesystem, because we're going to have to
completely rethink the way grow/shrink operations are managed and
controlled from an admin perspsective.

This is not the time or place to be overcomplicating a simple
extension to an existing filesystem operation that is completely
transparent to existing users.


> I just feel like there may be opportunities to improve fs/volume management
> integration for fat fs/volumes as well, so we need to keep them in mind when
> designing the new APIs.

I think you're misunderstanding my intentions here: I'm not
designing a new fs/volume management API.  In fact, I'm working in
the opposite direction. I want to get rid of the need to integrate
filesystem and volume manager functionality so that things like thin
provisioning work sanely.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-10-31 11:24     ` Brian Foster
@ 2017-11-01  0:45       ` Dave Chinner
  2017-11-01 14:17         ` Brian Foster
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2017-11-01  0:45 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Oct 31, 2017 at 07:24:32AM -0400, Brian Foster wrote:
> On Tue, Oct 31, 2017 at 08:09:41AM +1100, Dave Chinner wrote:
> > On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> > > On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> > > > This patchset is aimed at filesystems that are installed on sparse
> > > > block devices, a.k.a thin provisioned devices. The aim of the
> > > > patchset is to bring the space management aspect of the storage
> > > > stack up into the filesystem rather than keeping it below the
> > > > filesystem where users and the filesystem have no clue they are
> > > > about to run out of space.
....
> > I get that "total_blocks" sounds better, but to me that's a capacity
> > measurement, not an indication of the size of the underlying address
> > space the block device has provided. m_usable_blocks is obviously a
> > capacity measurement, but I was trying to convey that m_LBA_size is
> > not a capacity measurement but an externally imposed addressing
> > limit.
> > 
> > <shrug>
> > 
> > I guess if I document it well enough m_total_blocks will work.
> > 
> 
> Hmm, yeah I see what you mean. Unfortunately I can't really think of
> anything aside from m_total_blocks or perhaps m_phys_blocks at the
> moment.

m_phys_blocks seems closer to the intent. If that's acceptible I'll
change the code to that.

> > > Finally, I tend to agree with Amir's comment with regard to
> > > shrink/growfs... at least infosar as I understand his concern. If we do
> > > support physical shrink in the future, what do we expect the interface
> > > to look like in light of this change?
> > 
> > I don't expect it to look any different. It's exactly the same as
> > growfs - thinspace filesystem will simply do a logical grow/shrink,
> > fat filesystems will need to do a physical grow/shrink
> > adding/removing AGs.
> > 
> 
> How would you physically shrink a thin filesystem?

You wouldn't. There should never be a need to do this because it a
thinspace shrink doesn't actually free any space - it's just a usage
limit. fstrim is what actually shrinks the storage space used,
regardless of the current maximum capcity of the thin filesystem.

So I don't really see any point in physically shrink a thin
filesystem because that defeats the entire purpose of having a thin
filesystem in the first place.

> > > Given that this already requires on-disk changes and the addition of a
> > > feature bit, it seems prudent to me to update the growfs API
> > > accordingly. Isn't a growfs new_usable_blocks field or some such all we
> > > really need to address that concern?
> > 
> > I really do not see any reason for changing the growfs interface
> > right now. If there's a problem in future that physical shrink
> > introduces, we can rev the interface when the problem arises.
> > 
> 
> Indeed. I'm not terribly concerned about the growfs interface, but in
> the interest of trying to make sure the point is clear... does that mean
> we'd add a new usable_blocks field to the growfs structure if/when
> physical shrink is supported?

No. growfs simply says "grow/shrink to size X" and the kernel then
determines if a physical or thin grow/shrink operation is performed
and does the work/checks appropriate to make that work.

The only thing we need to ensure is that growfs has the necessary
info to correctly set the size of the filesystem when it is doing
operations that don't want to change the size of the filesystem
(e.g. changing imaxpct).  i.e. an old growfs should be able to
grow/shrink a thin filesystem correctly without being aware it is a
thin filesystem.

> If so, doesn't that mean we'd have two
> fields that are used to logically shrink the fs, depending on interface
> version, or that we'd have some kernels that logically shrink based on
> 'newblocks' and some that don't?

No.

> The larger question I have is what do we save by not just adding
> usable_blocks to growfs_data now so the interface remains
> self-explanatory? That seems relatively minor with respect to the
> on-disk changes.

We avoid having to modify an API for no real reason. We don't change
userspace APIs unless we have an actual need, and I'm not seeing
anyone articulate an actual need to change the growfs ioctl.

Further, changing the growfs API for thinspace filesystems means
that old growfs binaries will do bad things on thinspace filesystems
because growfs doesn't do version/feature checks. Hence we have to
make thinspace grow operations work sanely for existing
XFS_IOC_FSGEOMETRY_V4 callers, and that means we can't rely on a new
XFS_IOC_GROWFSDATA ioctl just to operate on thinspace filesystems.

*If* we add physical shrink support we can rev the growfs interface
at that point in time *if we need to*. But I'm simply not convinced
we will need to change it at all...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-01  0:45       ` Dave Chinner
@ 2017-11-01 14:17         ` Brian Foster
  2017-11-01 23:53           ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2017-11-01 14:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Nov 01, 2017 at 11:45:13AM +1100, Dave Chinner wrote:
> On Tue, Oct 31, 2017 at 07:24:32AM -0400, Brian Foster wrote:
> > On Tue, Oct 31, 2017 at 08:09:41AM +1100, Dave Chinner wrote:
> > > On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> > > > On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> > > > > This patchset is aimed at filesystems that are installed on sparse
> > > > > block devices, a.k.a thin provisioned devices. The aim of the
> > > > > patchset is to bring the space management aspect of the storage
> > > > > stack up into the filesystem rather than keeping it below the
> > > > > filesystem where users and the filesystem have no clue they are
> > > > > about to run out of space.
> ....
> > > I get that "total_blocks" sounds better, but to me that's a capacity
> > > measurement, not an indication of the size of the underlying address
> > > space the block device has provided. m_usable_blocks is obviously a
> > > capacity measurement, but I was trying to convey that m_LBA_size is
> > > not a capacity measurement but an externally imposed addressing
> > > limit.
> > > 
> > > <shrug>
> > > 
> > > I guess if I document it well enough m_total_blocks will work.
> > > 
> > 
> > Hmm, yeah I see what you mean. Unfortunately I can't really think of
> > anything aside from m_total_blocks or perhaps m_phys_blocks at the
> > moment.
> 
> m_phys_blocks seems closer to the intent. If that's acceptible I'll
> change the code to that.
> 

That works for me, thanks..

BTW, was there ever any kind of solution to the metadata block
reservation issue in the thin case? We now hide metadata reservation
from the user via the m_usable_blocks account. If m_phys_blocks
represents a thin volume, how exactly do we prevent those metadata
allocations/writes from overrunning what the admin has specified as
"usable" with respect to the thin volume?

> > > > Finally, I tend to agree with Amir's comment with regard to
> > > > shrink/growfs... at least infosar as I understand his concern. If we do
> > > > support physical shrink in the future, what do we expect the interface
> > > > to look like in light of this change?
> > > 
> > > I don't expect it to look any different. It's exactly the same as
> > > growfs - thinspace filesystem will simply do a logical grow/shrink,
> > > fat filesystems will need to do a physical grow/shrink
> > > adding/removing AGs.
> > > 
> > 
> > How would you physically shrink a thin filesystem?
> 
> You wouldn't. There should never be a need to do this because it a
> thinspace shrink doesn't actually free any space - it's just a usage
> limit. fstrim is what actually shrinks the storage space used,
> regardless of the current maximum capcity of the thin filesystem.
> 

In other words, the answer is that we can't physically shrink a thin fs
because of a limitation on the growfs interface due to how we've used it
here. That's fine, I'd just prefer to be explicit about the behavior
rather than attempt to predict that certain operations will never have a
legitimate use case, however likely that might seem right now. More
importantly, then I can at least understand we're on the same page with
regard to the ramifications of this change and we can simply agree to
disagree. ;)

I do agree that the effect is essentially redundant on thin filesystems
from the perspective of administration of a thin pool. I don't really
have a specific use case in mind for why somebody would want to
physically shrink a thin enabled fs. From the thin management
perspective, it seems more likely to me that we'd find users who want to
physically grow a filesystem without logically growing it at the same
time (another limitation of the current interface) before we're in the
situation where physical shrink is a real concern.

My perspective on physical shrink is more that the functionality still
exists via the management of the thin volume itself, so if we someday
support physical shrink, somebody is bound to find a decent enough
reason to want to do it on a thin fs. Maybe they want to convert a thin
volume to a fat one or wanted to shrink the geometry to reduce metadata
overhead or something. Anyways, it seems a bit unfortunate that the only
reason they couldn't is due to too narrow an interface.

> So I don't really see any point in physically shrink a thin
> filesystem because that defeats the entire purpose of having a thin
> filesystem in the first place.
> 
> > > > Given that this already requires on-disk changes and the addition of a
> > > > feature bit, it seems prudent to me to update the growfs API
> > > > accordingly. Isn't a growfs new_usable_blocks field or some such all we
> > > > really need to address that concern?
> > > 
> > > I really do not see any reason for changing the growfs interface
> > > right now. If there's a problem in future that physical shrink
> > > introduces, we can rev the interface when the problem arises.
> > > 
> > 
> > Indeed. I'm not terribly concerned about the growfs interface, but in
> > the interest of trying to make sure the point is clear... does that mean
> > we'd add a new usable_blocks field to the growfs structure if/when
> > physical shrink is supported?
> 
> No. growfs simply says "grow/shrink to size X" and the kernel then
> determines if a physical or thin grow/shrink operation is performed
> and does the work/checks appropriate to make that work.
> 

I understand how the current change is intended to work... that doesn't
really answer the question. You acknowledge below that we could support
physical shrink on thin fs' by revving the growfs interface. I assumed
above that means the new interface would handle physical and usable
blocks separately. If that is not the case, then how would you propose
to define that new interface to handle both physical and logical shrink?

With that new/revved interface, would we not somehow have to maintain
backwards compatibility of the old interface with regard to thin enabled
filesystems?

> The only thing we need to ensure is that growfs has the necessary
> info to correctly set the size of the filesystem when it is doing
> operations that don't want to change the size of the filesystem
> (e.g. changing imaxpct).  i.e. an old growfs should be able to
> grow/shrink a thin filesystem correctly without being aware it is a
> thin filesystem.
> 
> > If so, doesn't that mean we'd have two
> > fields that are used to logically shrink the fs, depending on interface
> > version, or that we'd have some kernels that logically shrink based on
> > 'newblocks' and some that don't?
> 
> No.
> 
> > The larger question I have is what do we save by not just adding
> > usable_blocks to growfs_data now so the interface remains
> > self-explanatory? That seems relatively minor with respect to the
> > on-disk changes.
> 
> We avoid having to modify an API for no real reason. We don't change
> userspace APIs unless we have an actual need, and I'm not seeing
> anyone articulate an actual need to change the growfs ioctl.
> 

Understood. In this case, it seems to me we're changing the meaning of
an existing user interface to avoid changing the interface for a new
feature. IOW, what would the approach be here if we already supported
physical shrink?

> Further, changing the growfs API for thinspace filesystems means
> that old growfs binaries will do bad things on thinspace filesystems
> because growfs doesn't do version/feature checks. Hence we have to
> make thinspace grow operations work sanely for existing
> XFS_IOC_FSGEOMETRY_V4 callers, and that means we can't rely on a new
> XFS_IOC_GROWFSDATA ioctl just to operate on thinspace filesystems.
> 

Changing/revving the interface means the old interface continues to only
imply physical grow/shrink. That means the new interface is required to
manage logical grow/shrink, which is not that far off from other new
features we add that require updated userspace for support. We already
have to add a feature check to make the current interface perform a
logical shrink rather than an error, so I'm not following why that's
such a technical hurdle.

Brian

> *If* we add physical shrink support we can rev the growfs interface
> at that point in time *if we need to*. But I'm simply not convinced
> we will need to change it at all...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-10-26 12:35   ` Dave Chinner
@ 2017-11-01 22:31     ` Darrick J. Wong
  0 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2017-11-01 22:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Amir Goldstein, linux-xfs, linux-fsdevel

On Thu, Oct 26, 2017 at 11:35:48PM +1100, Dave Chinner wrote:
> On Thu, Oct 26, 2017 at 02:09:26PM +0300, Amir Goldstein wrote:
> > On Thu, Oct 26, 2017 at 11:33 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > This patchset is aimed at filesystems that are installed on sparse
> > > block devices, a.k.a thin provisioned devices. The aim of the
> > > patchset is to bring the space management aspect of the storage
> > > stack up into the filesystem rather than keeping it below the
> > > filesystem where users and the filesystem have no clue they are
> > > about to run out of space.
> .....
> > > I've smoke tested the non-thinspace code paths (running auto tests
> > > on a scrub enabled kernel+userspace right now) as I haven't updated
> > > the userspace code to exercise the thinp code paths yet. I know the
> > > concept works, but my userspace code has an older on-disk format
> > > from the prototype so it will take me a couple of days to update and
> > > work out how to get fstests to integrate it reliably. So this is
> > > mainly a heads-up RFC patchset....
> > >
> > > Comments, thoughts, flames all welcome....
> > >
> > 
> > This proposal is very interesting outside the scope of xfs, so I hope you
> > don't mind I've CC'ed fsdevel.
> > 
> > I am thinking how a slightly similar approach could be used to online shrink
> > the physical size for filesystems that are not on thin provisioned devices:
> > 
> > - Set/get a geometry variable of "agsoftlimit" (better names are welcome)
> >   which is <= agcount.
> > - agsoftlimit < agcount means that free space of AG > agsoftlimit is zero,
> >   so total disk space usage will not show this space as available user space.
> > - inode and block allocators will avoid dipping into the high AG pool,
> >   expect for metadata block needed for freeing high AG inodes/blocks.
> > - A variant of xfs_fsr (or e4defrag for that matter) could "migrate" inodes
> >   and/or blocks from high to low AGs.
> > - Migrating directories is quite different than migrating files, but doable.
> > - Finally, on XFS_IOC_FSGROWFSDATA, if shrinking filesystem size and
> >   high AG usage counters are zero, then physical size can be shrunk
> >   as down as agsoftlimit instead of reducing usable_blocks.
> 
> Yup, you've just described all the craziness that a physical shrink
> requires on XFS. Lots of new user APIs, new tools to move data
> around, new code to transparently migrate directories and other
> metadata (like xattrs), etc.
> 
> Also, the log is placed half way through the XFS filesystem, so
> unless we add code to allocate and switch to a new journal (in a
> crash safe and recoverable way!) we can't shrink by more than 50%.
> 
> Also, none of the growfs code touches existing AGs - they'll have to
> be scanned to determine they really are empty before they get
> removed from the filesystem, and then there's the other issues like
> we can't shrink to less than 2 AGs, which puts a significant minimum
> shrink size on filesystems (again there's that "shrink more than 50%
> requires a lot more work" problem for filesystems < 4TB).
> 
> And to do it efficiently, we really need rmap support in filesystems
> so the fs can tell us what files and metadata need to be moved,
> rather than having to do brute force scans to work out what needs
> moving. Especially as the brute force scans can't find all the
> metadata that we might need to relocate before we've emptied the
> space we need to stop using.
> 
> IOWs, it's a *lot* of work, and IMO there's more work in
> verification and proving that everything is crash safe, recoverable
> and restartable. We've known how much work it is for years - why do
> you think it hasn't been implemented? See:
> 
> http://xfs.org/index.php/Shrinking_Support
> 
> And:
> 
> http://xfs.org/index.php/Unfinished_work#The_xfs_reno_tool
> 
> And specifically follow the reference to a discussion in 2007:
> 
> https://marc.info/?l=linux-xfs&m=119131697224361&w=2
> 
> > With this, xfs can gain physical shrink support and ext4 can gain online
> > (and safe) shrink support.
> 
> Yes, I estimate it'll probably take about a man-year's worth of work
> to get xfs shrink to production ready from all the pieces we have
> sitting around today.

Ewww, physical shrink.  Maybe that becomes feasible after parent pointer
support lands, both from a "making the directory rewrite easier" and a
"do the reviewers have time for this?" perspective. :)

I've worked on bashing resize2fs into better shape for shrink support;
the things you have to do (even on ext4, which doesn't share extents) to
the fs are pretty awful.  Ideally you'd move whole extents (or just
defrag the file into the space that will be left) but once reflink comes
into play you /have/ to have a strategy for maintaining the sharedness
across the migration or else you run the risk of blowing up the space
usage.

That's a lot to review, even if the strategy is "bail out with ENOSPC
having potentially done a ton of work and/or fragmented the fs".

--D

> > Assuming that this idea is not shot down on sight, the only implication
> > I can think of w.r.t your current patches is leaving enough room in new APIs
> > to accomodate this prospect functionality.
> 
> I'm not introducing any new APIs. XFS_IOC_FSGROWFSDATA already
> supports shrinking and resizing/moving the log, they just aren't
> implemented.
> 
> > You have already reserved 15 u64 in geometry V5 ioctl struct, so that's good.
> > You have not changed XFS_IOC_FSGROWFSDATA at all, so going forward
> > the ambiguity of physical shrink vs. virtual shrink could either be determined
> > by heuristics
> 
> No heuristics at all. filesystems on thin devices will have a
> feature bit in the superblock indicating they are thin filesystems.
> If the "thinspace" bit is set, shrink is just an accounting
> operation. If it's not set, then it needs to physically change the
> geometry of the filesystem....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-01 14:17         ` Brian Foster
@ 2017-11-01 23:53           ` Dave Chinner
  2017-11-02 11:25             ` Brian Foster
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2017-11-01 23:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Nov 01, 2017 at 10:17:21AM -0400, Brian Foster wrote:
> On Wed, Nov 01, 2017 at 11:45:13AM +1100, Dave Chinner wrote:
> > On Tue, Oct 31, 2017 at 07:24:32AM -0400, Brian Foster wrote:
> > > On Tue, Oct 31, 2017 at 08:09:41AM +1100, Dave Chinner wrote:
> > > > On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> > > > > On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> > > > > > This patchset is aimed at filesystems that are installed on sparse
> > > > > > block devices, a.k.a thin provisioned devices. The aim of the
> > > > > > patchset is to bring the space management aspect of the storage
> > > > > > stack up into the filesystem rather than keeping it below the
> > > > > > filesystem where users and the filesystem have no clue they are
> > > > > > about to run out of space.
> > ....
> > > > I get that "total_blocks" sounds better, but to me that's a capacity
> > > > measurement, not an indication of the size of the underlying address
> > > > space the block device has provided. m_usable_blocks is obviously a
> > > > capacity measurement, but I was trying to convey that m_LBA_size is
> > > > not a capacity measurement but an externally imposed addressing
> > > > limit.
> > > > 
> > > > <shrug>
> > > > 
> > > > I guess if I document it well enough m_total_blocks will work.
> > > > 
> > > 
> > > Hmm, yeah I see what you mean. Unfortunately I can't really think of
> > > anything aside from m_total_blocks or perhaps m_phys_blocks at the
> > > moment.
> > 
> > m_phys_blocks seems closer to the intent. If that's acceptible I'll
> > change the code to that.
> > 
> 
> That works for me, thanks..
> 
> BTW, was there ever any kind of solution to the metadata block
> reservation issue in the thin case? We now hide metadata reservation
> from the user via the m_usable_blocks account. If m_phys_blocks
> represents a thin volume, how exactly do we prevent those metadata
> allocations/writes from overrunning what the admin has specified as
> "usable" with respect to the thin volume?

The reserved metadata blocks are not accounted from free space when
they are allocated - they are pulled from the reserved space that
has already been removed from the free space.

i.e. we can use as much or as little of the reserved space as we
want, but it doesn't affect the free/used space reported to
userspace at all.

> > > > > Finally, I tend to agree with Amir's comment with regard to
> > > > > shrink/growfs... at least infosar as I understand his concern. If we do
> > > > > support physical shrink in the future, what do we expect the interface
> > > > > to look like in light of this change?
> > > > 
> > > > I don't expect it to look any different. It's exactly the same as
> > > > growfs - thinspace filesystem will simply do a logical grow/shrink,
> > > > fat filesystems will need to do a physical grow/shrink
> > > > adding/removing AGs.
> > > > 
> > > 
> > > How would you physically shrink a thin filesystem?
> > 
> > You wouldn't. There should never be a need to do this because it a
> > thinspace shrink doesn't actually free any space - it's just a usage
> > limit. fstrim is what actually shrinks the storage space used,
> > regardless of the current maximum capcity of the thin filesystem.
> 
> In other words, the answer is that we can't physically shrink a thin fs
> because of a limitation on the growfs interface due to how we've used it
> here.

No, that is not the what I said. To paraphrase, what I said was "we
aren't going to support physically shrinking thin filesystems at
this point in time". That has nothing to do with the growfs API -
it's an implementation choice that reflects the fact we can't
physically shrink filesystems and that functionality is no closer to
being implemented than it was 10+ years ago.

i.e. we don't need to rev the interface to support shrink on thin
filesystems, so there's no need to rev the interface at this point
in time.

*If* we implement physical shrink, *then* we can rev the growfs
interface to allow users to run a physical shrink on thin
filesystems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-01 23:53           ` Dave Chinner
@ 2017-11-02 11:25             ` Brian Foster
  2017-11-02 23:30               ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2017-11-02 11:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Nov 02, 2017 at 10:53:00AM +1100, Dave Chinner wrote:
> On Wed, Nov 01, 2017 at 10:17:21AM -0400, Brian Foster wrote:
> > On Wed, Nov 01, 2017 at 11:45:13AM +1100, Dave Chinner wrote:
> > > On Tue, Oct 31, 2017 at 07:24:32AM -0400, Brian Foster wrote:
> > > > On Tue, Oct 31, 2017 at 08:09:41AM +1100, Dave Chinner wrote:
> > > > > On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> > > > > > On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> > > > > > > This patchset is aimed at filesystems that are installed on sparse
> > > > > > > block devices, a.k.a thin provisioned devices. The aim of the
> > > > > > > patchset is to bring the space management aspect of the storage
> > > > > > > stack up into the filesystem rather than keeping it below the
> > > > > > > filesystem where users and the filesystem have no clue they are
> > > > > > > about to run out of space.
> > > ....
> > > > > I get that "total_blocks" sounds better, but to me that's a capacity
> > > > > measurement, not an indication of the size of the underlying address
> > > > > space the block device has provided. m_usable_blocks is obviously a
> > > > > capacity measurement, but I was trying to convey that m_LBA_size is
> > > > > not a capacity measurement but an externally imposed addressing
> > > > > limit.
> > > > > 
> > > > > <shrug>
> > > > > 
> > > > > I guess if I document it well enough m_total_blocks will work.
> > > > > 
> > > > 
> > > > Hmm, yeah I see what you mean. Unfortunately I can't really think of
> > > > anything aside from m_total_blocks or perhaps m_phys_blocks at the
> > > > moment.
> > > 
> > > m_phys_blocks seems closer to the intent. If that's acceptible I'll
> > > change the code to that.
> > > 
> > 
> > That works for me, thanks..
> > 
> > BTW, was there ever any kind of solution to the metadata block
> > reservation issue in the thin case? We now hide metadata reservation
> > from the user via the m_usable_blocks account. If m_phys_blocks
> > represents a thin volume, how exactly do we prevent those metadata
> > allocations/writes from overrunning what the admin has specified as
> > "usable" with respect to the thin volume?
> 
> The reserved metadata blocks are not accounted from free space when
> they are allocated - they are pulled from the reserved space that
> has already been removed from the free space.
> 

Ok, so the user can set a usable blocks value of something less than the
fs geometry, then the reservation is pulled from that, reducing the
reported "usable" value further. Hence, what ends up reported to the
user is actually something less than the value set by the user, which
means that the filesystem overall respects how much space the admin says
it can use in the underlying volume.

For example, the user creates a 100T thin volume with 10T of usable
space. The fs reserves a further 2T out of that for metadata, so then
what the user sees is 8T of writeable space. The filesystem itself
cannot use more than 10T out of the volume, as instructed. Am I
following that correctly? If so, that sounds reasonable to me from the
"don't overflow my thin volume" perspective.

> i.e. we can use as much or as little of the reserved space as we
> want, but it doesn't affect the free/used space reported to
> userspace at all.
> 
> > > > > > Finally, I tend to agree with Amir's comment with regard to
> > > > > > shrink/growfs... at least infosar as I understand his concern. If we do
> > > > > > support physical shrink in the future, what do we expect the interface
> > > > > > to look like in light of this change?
> > > > > 
> > > > > I don't expect it to look any different. It's exactly the same as
> > > > > growfs - thinspace filesystem will simply do a logical grow/shrink,
> > > > > fat filesystems will need to do a physical grow/shrink
> > > > > adding/removing AGs.
> > > > > 
> > > > 
> > > > How would you physically shrink a thin filesystem?
> > > 
> > > You wouldn't. There should never be a need to do this because it a
> > > thinspace shrink doesn't actually free any space - it's just a usage
> > > limit. fstrim is what actually shrinks the storage space used,
> > > regardless of the current maximum capcity of the thin filesystem.
> > 
> > In other words, the answer is that we can't physically shrink a thin fs
> > because of a limitation on the growfs interface due to how we've used it
> > here.
> 
> No, that is not the what I said. To paraphrase, what I said was "we
> aren't going to support physically shrinking thin filesystems at
> this point in time". That has nothing to do with the growfs API -
> it's an implementation choice that reflects the fact we can't
> physically shrink filesystems and that functionality is no closer to
> being implemented than it was 10+ years ago.
> 

And I'm attempting to examine the ramifications of the decision to reuse
the physical shrink interface for logical shrink. IOW, we can decide
whether or not to allow physical shrink of a thin fs independent from
designing an interface that is capable of supporting it. Just like we
already have an interface that supports physical shrink, even though it
obviously doesn't work.

> i.e. we don't need to rev the interface to support shrink on thin
> filesystems, so there's no need to rev the interface at this point
> in time.
> 
> *If* we implement physical shrink, *then* we can rev the growfs
> interface to allow users to run a physical shrink on thin
> filesystems.
> 

Subsequently, pretty much the remainder of my last mail is based on the
following predicates:

- We've incorporated this change to use growfs->newblocks for logical
  shrink.
- We've implemented physical shrink.
- We've revved the growfs interface to support physical shrink on thin
  filesystems.

I'm not going to repeat all of the previous points... suffice it to say
that asserting that we only have to rev the interface if/when we support
physical shrink in response is a circular argument. I understand that
and I agree. I'm attempting to review how that would look due to the
implementation of this feature, particularly with respect to backwards
compatibility of the existing interface.

IOW, you're using the argument that we can rev the growfs interface in
response to the initial argument regarding the the inability to
physically shrink a thin fs. As a result, I'm claiming that revving the
interface in the future for physical shrink may create more interface
clumsiness than it's worth compared to just revving it now for logical
shrink. In response to the points I attempt to make around that, you
argue above that we aren't any closer to physical shrink than we were 10
years ago and that we don't have to rev the interface unless we support
physical shrink. Round and round... ;P

The best I can read into the response here is that you think physical
shrink is unlikely enough to not need to care very much what kind of
interface confusion could result from needing to rev the current growfs
interface to support physical shrink on thin filesystems in the future.
Is that a fair assessment..?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-02 11:25             ` Brian Foster
@ 2017-11-02 23:30               ` Dave Chinner
  2017-11-03  2:47                 ` Darrick J. Wong
  2017-11-03 11:26                 ` Brian Foster
  0 siblings, 2 replies; 47+ messages in thread
From: Dave Chinner @ 2017-11-02 23:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Nov 02, 2017 at 07:25:33AM -0400, Brian Foster wrote:
> On Thu, Nov 02, 2017 at 10:53:00AM +1100, Dave Chinner wrote:
> > On Wed, Nov 01, 2017 at 10:17:21AM -0400, Brian Foster wrote:
> > > On Wed, Nov 01, 2017 at 11:45:13AM +1100, Dave Chinner wrote:
> > > > On Tue, Oct 31, 2017 at 07:24:32AM -0400, Brian Foster wrote:
> > > > > On Tue, Oct 31, 2017 at 08:09:41AM +1100, Dave Chinner wrote:
> > > > > > On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> > > > > > > On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> > > > > > > > This patchset is aimed at filesystems that are installed on sparse
> > > > > > > > block devices, a.k.a thin provisioned devices. The aim of the
> > > > > > > > patchset is to bring the space management aspect of the storage
> > > > > > > > stack up into the filesystem rather than keeping it below the
> > > > > > > > filesystem where users and the filesystem have no clue they are
> > > > > > > > about to run out of space.
> > > > ....
> > > > > > I get that "total_blocks" sounds better, but to me that's a capacity
> > > > > > measurement, not an indication of the size of the underlying address
> > > > > > space the block device has provided. m_usable_blocks is obviously a
> > > > > > capacity measurement, but I was trying to convey that m_LBA_size is
> > > > > > not a capacity measurement but an externally imposed addressing
> > > > > > limit.
> > > > > > 
> > > > > > <shrug>
> > > > > > 
> > > > > > I guess if I document it well enough m_total_blocks will work.
> > > > > > 
> > > > > 
> > > > > Hmm, yeah I see what you mean. Unfortunately I can't really think of
> > > > > anything aside from m_total_blocks or perhaps m_phys_blocks at the
> > > > > moment.
> > > > 
> > > > m_phys_blocks seems closer to the intent. If that's acceptible I'll
> > > > change the code to that.
> > > > 
> > > 
> > > That works for me, thanks..
> > > 
> > > BTW, was there ever any kind of solution to the metadata block
> > > reservation issue in the thin case? We now hide metadata reservation
> > > from the user via the m_usable_blocks account. If m_phys_blocks
> > > represents a thin volume, how exactly do we prevent those metadata
> > > allocations/writes from overrunning what the admin has specified as
> > > "usable" with respect to the thin volume?
> > 
> > The reserved metadata blocks are not accounted from free space when
> > they are allocated - they are pulled from the reserved space that
> > has already been removed from the free space.
> > 
> 
> Ok, so the user can set a usable blocks value of something less than the
> fs geometry, then the reservation is pulled from that, reducing the
> reported "usable" value further. Hence, what ends up reported to the
> user is actually something less than the value set by the user, which
> means that the filesystem overall respects how much space the admin says
> it can use in the underlying volume.
> 
> For example, the user creates a 100T thin volume with 10T of usable
> space. The fs reserves a further 2T out of that for metadata, so then
> what the user sees is 8T of writeable space.  The filesystem itself
> cannot use more than 10T out of the volume, as instructed. Am I
> following that correctly? If so, that sounds reasonable to me from the
> "don't overflow my thin volume" perspective.

No, that's not what happens. For thick filesystems, the 100TB volume
gets 2TB pulled from it so it appears as a 98TB filesystem. This is
done by modifying the free block counts and m_usable_space when the
reservations are made.

For thin filesystems, we've already got 90TB of space "reserved",
and so the metadata reservations and allocations come from that.
i.e. we skip the modification of free block counts and m_usable
space in the case of a thinspace filesystem, and so the user still
sees 10TB of usable space that they asked to have.

> > i.e. we don't need to rev the interface to support shrink on thin
> > filesystems, so there's no need to rev the interface at this point
> > in time.
> > 
> > *If* we implement physical shrink, *then* we can rev the growfs
> > interface to allow users to run a physical shrink on thin
> > filesystems.
> > 
> 
> Subsequently, pretty much the remainder of my last mail is based on the
> following predicates:
> 
> - We've incorporated this change to use growfs->newblocks for logical
>   shrink.
> - We've implemented physical shrink.
> - We've revved the growfs interface to support physical shrink on thin
>   filesystems.
> 
> I'm not going to repeat all of the previous points... suffice it to say
> that asserting that we only have to rev the interface if/when we support
> physical shrink in response is a circular argument.

No, it's not.

History tells us that every time someone does this sort of "future
planning" for a fixed ABI interface they don't get it right. Hence
when that new functionality comes along the interface has to be
changed /again/ to fix whatever previous assumptions or requirements
were found to be invalid when development of the feature was
actually done.

The syscall and ioctl APIs are littered with examples and we should
pay heed to the lessons those mistakes teach us. i.e. don't
speculate about future API requirements for functionality that may
not exist or may change substantially from what it is envisiaged to
need right now.

> I understand that
> and I agree. I'm attempting to review how that would look due to the
> implementation of this feature, particularly with respect to backwards
> compatibility of the existing interface.
> 
> IOW, you're using the argument that we can rev the growfs interface in
> response to the initial argument regarding the the inability to
> physically shrink a thin fs.

Yes, because when we get physical shrink it is likely that the
growfs interface is going to need more than just a new set of flags.

Off the top of my head, we're going to new user-kernel co-ordination
requirements such as read-only AGs while we move data out of them in
a failsafe manner, interleave this with a whole new set of new
"cannot perform operation" cases, handle new partial failure cases,
restarting an interrupted shrink, and maybe even a need to ignore
partial state to complete a previously failed shrink.

And there's even the possibility that we're going to have to limit
physical shrink to rmap enabled filesystems (e.g. how do you find the
owner of that bmbt block that is pinning the AG we need to remove
without rmap?). At whcih point, a user might want to "force shrink"
and then take the filesystem down and run repair....

IOWs, to say all we are going to need from the growfs ioctl for a
physical shrink interface is a flag to to tell thin filesystems to
do a physical shrink is assuming an awful lot about the physical
shrinking implementation and how users will want to interact with
the operation. Until we've done all the design work and are well
into the implementation, we're not going to have a clue about what
we actually need in way of grwofs interface changes for the physical
shrink.

As such, trying to determine a future proof growfs interface change
right now is simply a waste of time because we're not going to get
it right.

> As a result, I'm claiming that revving the
> interface in the future for physical shrink may create more interface
> clumsiness than it's worth compared to just revving it now for logical
> shrink. In response to the points I attempt to make around that, you
> argue above that we aren't any closer to physical shrink than we were 10
> years ago and that we don't have to rev the interface unless we support
> physical shrink. Round and round... ;P
> 
> The best I can read into the response here is that you think physical
> shrink is unlikely enough to not need to care very much what kind of
> interface confusion could result from needing to rev the current growfs
> interface to support physical shrink on thin filesystems in the future.
> Is that a fair assessment..?

Not really. I understand just how complex a physical shrink
implementation is going to be, and have a fair idea of the sorts of
craziness we'll need to add to xfs_growfs to support/co-ordinate a
physical shrink operation.  From that perspective, I don't see a
physical shrink working with an unchanged growfs interface. The
discussion about whether or not we should physically shrink
thinspace filesystems is almost completely irrelevant to the
interface requirements of a physical shrink....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-02 23:30               ` Dave Chinner
@ 2017-11-03  2:47                 ` Darrick J. Wong
  2017-11-03 11:36                   ` Brian Foster
  2017-11-03 11:26                 ` Brian Foster
  1 sibling, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2017-11-03  2:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Fri, Nov 03, 2017 at 10:30:17AM +1100, Dave Chinner wrote:
> On Thu, Nov 02, 2017 at 07:25:33AM -0400, Brian Foster wrote:
> > On Thu, Nov 02, 2017 at 10:53:00AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 01, 2017 at 10:17:21AM -0400, Brian Foster wrote:
> > > > On Wed, Nov 01, 2017 at 11:45:13AM +1100, Dave Chinner wrote:
> > > > > On Tue, Oct 31, 2017 at 07:24:32AM -0400, Brian Foster wrote:
> > > > > > On Tue, Oct 31, 2017 at 08:09:41AM +1100, Dave Chinner wrote:
> > > > > > > On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> > > > > > > > On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> > > > > > > > > This patchset is aimed at filesystems that are installed on sparse
> > > > > > > > > block devices, a.k.a thin provisioned devices. The aim of the
> > > > > > > > > patchset is to bring the space management aspect of the storage
> > > > > > > > > stack up into the filesystem rather than keeping it below the
> > > > > > > > > filesystem where users and the filesystem have no clue they are
> > > > > > > > > about to run out of space.
> > > > > ....
> > > > > > > I get that "total_blocks" sounds better, but to me that's a capacity
> > > > > > > measurement, not an indication of the size of the underlying address
> > > > > > > space the block device has provided. m_usable_blocks is obviously a
> > > > > > > capacity measurement, but I was trying to convey that m_LBA_size is
> > > > > > > not a capacity measurement but an externally imposed addressing
> > > > > > > limit.
> > > > > > > 
> > > > > > > <shrug>
> > > > > > > 
> > > > > > > I guess if I document it well enough m_total_blocks will work.
> > > > > > > 
> > > > > > 
> > > > > > Hmm, yeah I see what you mean. Unfortunately I can't really think of
> > > > > > anything aside from m_total_blocks or perhaps m_phys_blocks at the
> > > > > > moment.
> > > > > 
> > > > > m_phys_blocks seems closer to the intent. If that's acceptible I'll
> > > > > change the code to that.
> > > > > 
> > > > 
> > > > That works for me, thanks..
> > > > 
> > > > BTW, was there ever any kind of solution to the metadata block
> > > > reservation issue in the thin case? We now hide metadata reservation
> > > > from the user via the m_usable_blocks account. If m_phys_blocks
> > > > represents a thin volume, how exactly do we prevent those metadata
> > > > allocations/writes from overrunning what the admin has specified as
> > > > "usable" with respect to the thin volume?
> > > 
> > > The reserved metadata blocks are not accounted from free space when
> > > they are allocated - they are pulled from the reserved space that
> > > has already been removed from the free space.
> > > 
> > 
> > Ok, so the user can set a usable blocks value of something less than the
> > fs geometry, then the reservation is pulled from that, reducing the
> > reported "usable" value further. Hence, what ends up reported to the
> > user is actually something less than the value set by the user, which
> > means that the filesystem overall respects how much space the admin says
> > it can use in the underlying volume.
> > 
> > For example, the user creates a 100T thin volume with 10T of usable
> > space. The fs reserves a further 2T out of that for metadata, so then
> > what the user sees is 8T of writeable space.  The filesystem itself
> > cannot use more than 10T out of the volume, as instructed. Am I
> > following that correctly? If so, that sounds reasonable to me from the
> > "don't overflow my thin volume" perspective.
> 
> No, that's not what happens. For thick filesystems, the 100TB volume
> gets 2TB pulled from it so it appears as a 98TB filesystem. This is
> done by modifying the free block counts and m_usable_space when the
> reservations are made.
> 
> For thin filesystems, we've already got 90TB of space "reserved",
> and so the metadata reservations and allocations come from that.
> i.e. we skip the modification of free block counts and m_usable
> space in the case of a thinspace filesystem, and so the user still
> sees 10TB of usable space that they asked to have.
> 
> > > i.e. we don't need to rev the interface to support shrink on thin
> > > filesystems, so there's no need to rev the interface at this point
> > > in time.
> > > 
> > > *If* we implement physical shrink, *then* we can rev the growfs
> > > interface to allow users to run a physical shrink on thin
> > > filesystems.
> > > 
> > 
> > Subsequently, pretty much the remainder of my last mail is based on the
> > following predicates:
> > 
> > - We've incorporated this change to use growfs->newblocks for logical
> >   shrink.
> > - We've implemented physical shrink.
> > - We've revved the growfs interface to support physical shrink on thin
> >   filesystems.
> > 
> > I'm not going to repeat all of the previous points... suffice it to say
> > that asserting that we only have to rev the interface if/when we support
> > physical shrink in response is a circular argument.
> 
> No, it's not.
> 
> History tells us that every time someone does this sort of "future
> planning" for a fixed ABI interface they don't get it right. Hence
> when that new functionality comes along the interface has to be
> changed /again/ to fix whatever previous assumptions or requirements
> were found to be invalid when development of the feature was
> actually done.
> 
> The syscall and ioctl APIs are littered with examples and we should
> pay heed to the lessons those mistakes teach us. i.e. don't
> speculate about future API requirements for functionality that may
> not exist or may change substantially from what it is envisiaged to
> need right now.
> 
> > I understand that
> > and I agree. I'm attempting to review how that would look due to the
> > implementation of this feature, particularly with respect to backwards
> > compatibility of the existing interface.
> > 
> > IOW, you're using the argument that we can rev the growfs interface in
> > response to the initial argument regarding the the inability to
> > physically shrink a thin fs.
> 
> Yes, because when we get physical shrink it is likely that the
> growfs interface is going to need more than just a new set of flags.
> 
> Off the top of my head, we're going to new user-kernel co-ordination
> requirements such as read-only AGs while we move data out of them in
> a failsafe manner, interleave this with a whole new set of new
> "cannot perform operation" cases, handle new partial failure cases,
> restarting an interrupted shrink, and maybe even a need to ignore
> partial state to complete a previously failed shrink.
> 
> And there's even the possibility that we're going to have to limit
> physical shrink to rmap enabled filesystems (e.g. how do you find the
> owner of that bmbt block that is pinning the AG we need to remove
> without rmap?). At whcih point, a user might want to "force shrink"
> and then take the filesystem down and run repair....
> 
> IOWs, to say all we are going to need from the growfs ioctl for a
> physical shrink interface is a flag to to tell thin filesystems to
> do a physical shrink is assuming an awful lot about the physical
> shrinking implementation and how users will want to interact with
> the operation. Until we've done all the design work and are well
> into the implementation, we're not going to have a clue about what
> we actually need in way of grwofs interface changes for the physical
> shrink.
> 
> As such, trying to determine a future proof growfs interface change
> right now is simply a waste of time because we're not going to get
> it right.
> 
> > As a result, I'm claiming that revving the
> > interface in the future for physical shrink may create more interface
> > clumsiness than it's worth compared to just revving it now for logical
> > shrink. In response to the points I attempt to make around that, you
> > argue above that we aren't any closer to physical shrink than we were 10
> > years ago and that we don't have to rev the interface unless we support
> > physical shrink. Round and round... ;P
> > 
> > The best I can read into the response here is that you think physical
> > shrink is unlikely enough to not need to care very much what kind of
> > interface confusion could result from needing to rev the current growfs
> > interface to support physical shrink on thin filesystems in the future.
> > Is that a fair assessment..?
> 
> Not really. I understand just how complex a physical shrink
> implementation is going to be, and have a fair idea of the sorts of
> craziness we'll need to add to xfs_growfs to support/co-ordinate a
> physical shrink operation.  From that perspective, I don't see a
> physical shrink working with an unchanged growfs interface. The
> discussion about whether or not we should physically shrink
> thinspace filesystems is almost completely irrelevant to the
> interface requirements of a physical shrink....

FWIW the way I've been modelling this patch series in my head is that we
format an arbitrarily large filesystem (m_LBA_size) address space on a
thinp, feed statfs an "adjusted" size (m_usable_size)i which restricts
how much space we can allocate, and now growfs increases or decreases
the adjusted size without having to relocate anything or mess with the
address space.  If the adjusted size ever exceeds the address space
size, then we tack on more AGs like we've always done.  From that POV,
there's no need to physically shrink (i.e. relocate) anything (and we
can leave that for later/never).

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-02 23:30               ` Dave Chinner
  2017-11-03  2:47                 ` Darrick J. Wong
@ 2017-11-03 11:26                 ` Brian Foster
  2017-11-03 12:19                   ` Amir Goldstein
  2017-11-05 23:51                   ` Dave Chinner
  1 sibling, 2 replies; 47+ messages in thread
From: Brian Foster @ 2017-11-03 11:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Nov 03, 2017 at 10:30:17AM +1100, Dave Chinner wrote:
> On Thu, Nov 02, 2017 at 07:25:33AM -0400, Brian Foster wrote:
> > On Thu, Nov 02, 2017 at 10:53:00AM +1100, Dave Chinner wrote:
> > > On Wed, Nov 01, 2017 at 10:17:21AM -0400, Brian Foster wrote:
> > > > On Wed, Nov 01, 2017 at 11:45:13AM +1100, Dave Chinner wrote:
> > > > > On Tue, Oct 31, 2017 at 07:24:32AM -0400, Brian Foster wrote:
> > > > > > On Tue, Oct 31, 2017 at 08:09:41AM +1100, Dave Chinner wrote:
> > > > > > > On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> > > > > > > > On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
...
> > > > BTW, was there ever any kind of solution to the metadata block
> > > > reservation issue in the thin case? We now hide metadata reservation
> > > > from the user via the m_usable_blocks account. If m_phys_blocks
> > > > represents a thin volume, how exactly do we prevent those metadata
> > > > allocations/writes from overrunning what the admin has specified as
> > > > "usable" with respect to the thin volume?
> > > 
> > > The reserved metadata blocks are not accounted from free space when
> > > they are allocated - they are pulled from the reserved space that
> > > has already been removed from the free space.
> > > 
> > 
> > Ok, so the user can set a usable blocks value of something less than the
> > fs geometry, then the reservation is pulled from that, reducing the
> > reported "usable" value further. Hence, what ends up reported to the
> > user is actually something less than the value set by the user, which
> > means that the filesystem overall respects how much space the admin says
> > it can use in the underlying volume.
> > 
> > For example, the user creates a 100T thin volume with 10T of usable
> > space. The fs reserves a further 2T out of that for metadata, so then
> > what the user sees is 8T of writeable space.  The filesystem itself
> > cannot use more than 10T out of the volume, as instructed. Am I
> > following that correctly? If so, that sounds reasonable to me from the
> > "don't overflow my thin volume" perspective.
> 
> No, that's not what happens. For thick filesystems, the 100TB volume
> gets 2TB pulled from it so it appears as a 98TB filesystem. This is
> done by modifying the free block counts and m_usable_space when the
> reservations are made.
> 

Ok..

> For thin filesystems, we've already got 90TB of space "reserved",
> and so the metadata reservations and allocations come from that.
> i.e. we skip the modification of free block counts and m_usable
> space in the case of a thinspace filesystem, and so the user still
> sees 10TB of usable space that they asked to have.
> 

Hmm.. so then I'm slightly confused regarding the thin use case
regarding prevention of pool depletion. The usable blocks value that the
user settles on is likely based on how much space the filesystem should
use to safely avoid pool depletion. If a usable value of 10T means the
filesystem can write to the usable 10T + some amount of metadata
reservation, how does the user determine a sane usable value based on
the current pool geometry?

> > > i.e. we don't need to rev the interface to support shrink on thin
> > > filesystems, so there's no need to rev the interface at this point
> > > in time.
> > > 
> > > *If* we implement physical shrink, *then* we can rev the growfs
> > > interface to allow users to run a physical shrink on thin
> > > filesystems.
> > > 
> > 
> > Subsequently, pretty much the remainder of my last mail is based on the
> > following predicates:
> > 
> > - We've incorporated this change to use growfs->newblocks for logical
> >   shrink.
> > - We've implemented physical shrink.
> > - We've revved the growfs interface to support physical shrink on thin
> >   filesystems.
> > 
> > I'm not going to repeat all of the previous points... suffice it to say
> > that asserting that we only have to rev the interface if/when we support
> > physical shrink in response is a circular argument.
> 
> No, it's not.
> 
> History tells us that every time someone does this sort of "future
> planning" for a fixed ABI interface they don't get it right. Hence
> when that new functionality comes along the interface has to be
> changed /again/ to fix whatever previous assumptions or requirements
> were found to be invalid when development of the feature was
> actually done.
> 
> The syscall and ioctl APIs are littered with examples and we should
> pay heed to the lessons those mistakes teach us. i.e. don't
> speculate about future API requirements for functionality that may
> not exist or may change substantially from what it is envisiaged to
> need right now.
> 

Sure, that makes sense...

> > I understand that
> > and I agree. I'm attempting to review how that would look due to the
> > implementation of this feature, particularly with respect to backwards
> > compatibility of the existing interface.
> > 
> > IOW, you're using the argument that we can rev the growfs interface in
> > response to the initial argument regarding the the inability to
> > physically shrink a thin fs.
> 
> Yes, because when we get physical shrink it is likely that the
> growfs interface is going to need more than just a new set of flags.
> 
> Off the top of my head, we're going to new user-kernel co-ordination
> requirements such as read-only AGs while we move data out of them in
> a failsafe manner, interleave this with a whole new set of new
> "cannot perform operation" cases, handle new partial failure cases,
> restarting an interrupted shrink, and maybe even a need to ignore
> partial state to complete a previously failed shrink.
> 
> And there's even the possibility that we're going to have to limit
> physical shrink to rmap enabled filesystems (e.g. how do you find the
> owner of that bmbt block that is pinning the AG we need to remove
> without rmap?). At whcih point, a user might want to "force shrink"
> and then take the filesystem down and run repair....
> 
> IOWs, to say all we are going to need from the growfs ioctl for a
> physical shrink interface is a flag to to tell thin filesystems to
> do a physical shrink is assuming an awful lot about the physical
> shrinking implementation and how users will want to interact with
> the operation. Until we've done all the design work and are well
> into the implementation, we're not going to have a clue about what
> we actually need in way of grwofs interface changes for the physical
> shrink.
> 
> As such, trying to determine a future proof growfs interface change
> right now is simply a waste of time because we're not going to get
> it right.
> 

I'm not sure where the flags proposal thing came from, but regardless...
I'm not trying to future proof/design a shrink interface. Rather, just
reviewing the (use of the) interface as it already exists. I suppose I
am kind of assuming that the current interface would enable some form of
functionality in that hypothetical future world where physical shrink
exists. Perhaps that is not so realistic, however, as you suggest.

I'm not totally convinced of that that based on a dump of the challenges
the implementation would have to deal with, but I'm also not spending
much time thinking about the design/implementation of shrink. So I think
that's a reasonable enough argument to allow this feature to creep in on
the interface.

> > As a result, I'm claiming that revving the
> > interface in the future for physical shrink may create more interface
> > clumsiness than it's worth compared to just revving it now for logical
> > shrink. In response to the points I attempt to make around that, you
> > argue above that we aren't any closer to physical shrink than we were 10
> > years ago and that we don't have to rev the interface unless we support
> > physical shrink. Round and round... ;P
> > 
> > The best I can read into the response here is that you think physical
> > shrink is unlikely enough to not need to care very much what kind of
> > interface confusion could result from needing to rev the current growfs
> > interface to support physical shrink on thin filesystems in the future.
> > Is that a fair assessment..?
> 
> Not really. I understand just how complex a physical shrink
> implementation is going to be, and have a fair idea of the sorts of
> craziness we'll need to add to xfs_growfs to support/co-ordinate a
> physical shrink operation.  From that perspective, I don't see a
> physical shrink working with an unchanged growfs interface. The
> discussion about whether or not we should physically shrink
> thinspace filesystems is almost completely irrelevant to the
> interface requirements of a physical shrink....
> 

So it's not so much about the likelihood of realizing physical shrink,
but rather the likelihood that physical shrink would require to rev the
growfs structure anyways (regardless of this feature).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-03  2:47                 ` Darrick J. Wong
@ 2017-11-03 11:36                   ` Brian Foster
  2017-11-05 22:50                     ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2017-11-03 11:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Thu, Nov 02, 2017 at 07:47:40PM -0700, Darrick J. Wong wrote:
> On Fri, Nov 03, 2017 at 10:30:17AM +1100, Dave Chinner wrote:
> > On Thu, Nov 02, 2017 at 07:25:33AM -0400, Brian Foster wrote:
> > > On Thu, Nov 02, 2017 at 10:53:00AM +1100, Dave Chinner wrote:
> > > > On Wed, Nov 01, 2017 at 10:17:21AM -0400, Brian Foster wrote:
> > > > > On Wed, Nov 01, 2017 at 11:45:13AM +1100, Dave Chinner wrote:
> > > > > > On Tue, Oct 31, 2017 at 07:24:32AM -0400, Brian Foster wrote:
> > > > > > > On Tue, Oct 31, 2017 at 08:09:41AM +1100, Dave Chinner wrote:
> > > > > > > > On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> > > > > > > > > On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> > > > > > > > > > This patchset is aimed at filesystems that are installed on sparse
> > > > > > > > > > block devices, a.k.a thin provisioned devices. The aim of the
> > > > > > > > > > patchset is to bring the space management aspect of the storage
> > > > > > > > > > stack up into the filesystem rather than keeping it below the
> > > > > > > > > > filesystem where users and the filesystem have no clue they are
> > > > > > > > > > about to run out of space.
> > > > > > ....
> > > > > > > > I get that "total_blocks" sounds better, but to me that's a capacity
> > > > > > > > measurement, not an indication of the size of the underlying address
> > > > > > > > space the block device has provided. m_usable_blocks is obviously a
> > > > > > > > capacity measurement, but I was trying to convey that m_LBA_size is
> > > > > > > > not a capacity measurement but an externally imposed addressing
> > > > > > > > limit.
> > > > > > > > 
> > > > > > > > <shrug>
> > > > > > > > 
> > > > > > > > I guess if I document it well enough m_total_blocks will work.
> > > > > > > > 
> > > > > > > 
> > > > > > > Hmm, yeah I see what you mean. Unfortunately I can't really think of
> > > > > > > anything aside from m_total_blocks or perhaps m_phys_blocks at the
> > > > > > > moment.
> > > > > > 
> > > > > > m_phys_blocks seems closer to the intent. If that's acceptible I'll
> > > > > > change the code to that.
> > > > > > 
> > > > > 
> > > > > That works for me, thanks..
> > > > > 
> > > > > BTW, was there ever any kind of solution to the metadata block
> > > > > reservation issue in the thin case? We now hide metadata reservation
> > > > > from the user via the m_usable_blocks account. If m_phys_blocks
> > > > > represents a thin volume, how exactly do we prevent those metadata
> > > > > allocations/writes from overrunning what the admin has specified as
> > > > > "usable" with respect to the thin volume?
> > > > 
> > > > The reserved metadata blocks are not accounted from free space when
> > > > they are allocated - they are pulled from the reserved space that
> > > > has already been removed from the free space.
> > > > 
> > > 
> > > Ok, so the user can set a usable blocks value of something less than the
> > > fs geometry, then the reservation is pulled from that, reducing the
> > > reported "usable" value further. Hence, what ends up reported to the
> > > user is actually something less than the value set by the user, which
> > > means that the filesystem overall respects how much space the admin says
> > > it can use in the underlying volume.
> > > 
> > > For example, the user creates a 100T thin volume with 10T of usable
> > > space. The fs reserves a further 2T out of that for metadata, so then
> > > what the user sees is 8T of writeable space.  The filesystem itself
> > > cannot use more than 10T out of the volume, as instructed. Am I
> > > following that correctly? If so, that sounds reasonable to me from the
> > > "don't overflow my thin volume" perspective.
> > 
> > No, that's not what happens. For thick filesystems, the 100TB volume
> > gets 2TB pulled from it so it appears as a 98TB filesystem. This is
> > done by modifying the free block counts and m_usable_space when the
> > reservations are made.
> > 
> > For thin filesystems, we've already got 90TB of space "reserved",
> > and so the metadata reservations and allocations come from that.
> > i.e. we skip the modification of free block counts and m_usable
> > space in the case of a thinspace filesystem, and so the user still
> > sees 10TB of usable space that they asked to have.
> > 
> > > > i.e. we don't need to rev the interface to support shrink on thin
> > > > filesystems, so there's no need to rev the interface at this point
> > > > in time.
> > > > 
> > > > *If* we implement physical shrink, *then* we can rev the growfs
> > > > interface to allow users to run a physical shrink on thin
> > > > filesystems.
> > > > 
> > > 
> > > Subsequently, pretty much the remainder of my last mail is based on the
> > > following predicates:
> > > 
> > > - We've incorporated this change to use growfs->newblocks for logical
> > >   shrink.
> > > - We've implemented physical shrink.
> > > - We've revved the growfs interface to support physical shrink on thin
> > >   filesystems.
> > > 
> > > I'm not going to repeat all of the previous points... suffice it to say
> > > that asserting that we only have to rev the interface if/when we support
> > > physical shrink in response is a circular argument.
> > 
> > No, it's not.
> > 
> > History tells us that every time someone does this sort of "future
> > planning" for a fixed ABI interface they don't get it right. Hence
> > when that new functionality comes along the interface has to be
> > changed /again/ to fix whatever previous assumptions or requirements
> > were found to be invalid when development of the feature was
> > actually done.
> > 
> > The syscall and ioctl APIs are littered with examples and we should
> > pay heed to the lessons those mistakes teach us. i.e. don't
> > speculate about future API requirements for functionality that may
> > not exist or may change substantially from what it is envisiaged to
> > need right now.
> > 
> > > I understand that
> > > and I agree. I'm attempting to review how that would look due to the
> > > implementation of this feature, particularly with respect to backwards
> > > compatibility of the existing interface.
> > > 
> > > IOW, you're using the argument that we can rev the growfs interface in
> > > response to the initial argument regarding the the inability to
> > > physically shrink a thin fs.
> > 
> > Yes, because when we get physical shrink it is likely that the
> > growfs interface is going to need more than just a new set of flags.
> > 
> > Off the top of my head, we're going to new user-kernel co-ordination
> > requirements such as read-only AGs while we move data out of them in
> > a failsafe manner, interleave this with a whole new set of new
> > "cannot perform operation" cases, handle new partial failure cases,
> > restarting an interrupted shrink, and maybe even a need to ignore
> > partial state to complete a previously failed shrink.
> > 
> > And there's even the possibility that we're going to have to limit
> > physical shrink to rmap enabled filesystems (e.g. how do you find the
> > owner of that bmbt block that is pinning the AG we need to remove
> > without rmap?). At whcih point, a user might want to "force shrink"
> > and then take the filesystem down and run repair....
> > 
> > IOWs, to say all we are going to need from the growfs ioctl for a
> > physical shrink interface is a flag to to tell thin filesystems to
> > do a physical shrink is assuming an awful lot about the physical
> > shrinking implementation and how users will want to interact with
> > the operation. Until we've done all the design work and are well
> > into the implementation, we're not going to have a clue about what
> > we actually need in way of grwofs interface changes for the physical
> > shrink.
> > 
> > As such, trying to determine a future proof growfs interface change
> > right now is simply a waste of time because we're not going to get
> > it right.
> > 
> > > As a result, I'm claiming that revving the
> > > interface in the future for physical shrink may create more interface
> > > clumsiness than it's worth compared to just revving it now for logical
> > > shrink. In response to the points I attempt to make around that, you
> > > argue above that we aren't any closer to physical shrink than we were 10
> > > years ago and that we don't have to rev the interface unless we support
> > > physical shrink. Round and round... ;P
> > > 
> > > The best I can read into the response here is that you think physical
> > > shrink is unlikely enough to not need to care very much what kind of
> > > interface confusion could result from needing to rev the current growfs
> > > interface to support physical shrink on thin filesystems in the future.
> > > Is that a fair assessment..?
> > 
> > Not really. I understand just how complex a physical shrink
> > implementation is going to be, and have a fair idea of the sorts of
> > craziness we'll need to add to xfs_growfs to support/co-ordinate a
> > physical shrink operation.  From that perspective, I don't see a
> > physical shrink working with an unchanged growfs interface. The
> > discussion about whether or not we should physically shrink
> > thinspace filesystems is almost completely irrelevant to the
> > interface requirements of a physical shrink....
> 
> FWIW the way I've been modelling this patch series in my head is that we
> format an arbitrarily large filesystem (m_LBA_size) address space on a
> thinp, feed statfs an "adjusted" size (m_usable_size)i which restricts
> how much space we can allocate, and now growfs increases or decreases
> the adjusted size without having to relocate anything or mess with the
> address space.  If the adjusted size ever exceeds the address space
> size, then we tack on more AGs like we've always done.  From that POV,
> there's no need to physically shrink (i.e. relocate) anything (and we
> can leave that for later/never).
> 

Yeah, that's pretty much my exact perspective on how this feature would
work on the current interface. I don't contend that it wouldn't/couldn't
work with the current interface. It just seems to me that part of the
reason we're using that interface is simply because it's already there
as opposed to it necessarily being the right interface for the feature..

For example, suppose we had an absolute crude, barebones implementation
of physical shrink right now that basically trimmmed the amount of space
from the end of the fs iff those AGs were completely empty and otherwise
returned -EBUSY. There is no other userspace support, etc. As such, this
hypothetical feature is extremely limited to being usable immediately
after a growfs and thus probably has no use case other than "undo my
accidental growfs."

If we had that right now, _then_ what would the logical shrink interface
look like?

Brian

> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-03 11:26                 ` Brian Foster
@ 2017-11-03 12:19                   ` Amir Goldstein
  2017-11-06  1:16                     ` Dave Chinner
  2017-11-05 23:51                   ` Dave Chinner
  1 sibling, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2017-11-03 12:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Fri, Nov 3, 2017 at 1:26 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Fri, Nov 03, 2017 at 10:30:17AM +1100, Dave Chinner wrote:
>> On Thu, Nov 02, 2017 at 07:25:33AM -0400, Brian Foster wrote:
...
>>
>> As such, trying to determine a future proof growfs interface change
>> right now is simply a waste of time because we're not going to get
>> it right.
>>
>
> I'm not sure where the flags proposal thing came from, but regardless...
> I'm not trying to future proof/design a shrink interface. Rather, just
> reviewing the (use of the) interface as it already exists. I suppose I
> am kind of assuming that the current interface would enable some form of
> functionality in that hypothetical future world where physical shrink
> exists. Perhaps that is not so realistic, however, as you suggest.
>

Guys,

Can we PLEASE stop talking about physical shrink?
I get it, it's unlikely to happen, I sorry I brought up this example
in the first place.

Whether or not we implement physical shirk is not the main issue.
The main issue is implicitly changing the meaning of an existing API.
What can go wrong? I don't know and I rather not give examples,
because then people address the examples and not the conceptual flaw.

This is how I propose to smooth in the new API with as little pain as
possible for existing deployments, yet making sure that "usable block"
is only modified by new programs that intend to modify it:

----------------------
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 8f22fc579dbb..922798ebf3e8 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -171,10 +171,25 @@ xfs_growfs_data_private(
        xfs_rfsblock_t          nfree;
        xfs_agnumber_t          oagcount;
        int                     pct;
+       unsigned                ver;
        xfs_trans_t             *tp;

        nb = in->newblocks;
-       pct = in->imaxpct;
+       pct = in->imaxpct & 0xff;
+       ver = in->imaxpct >> 8;
+       if (ver > 1)
+               return -EINVAL;
+
+       /*
+        * V0 API is only allowed to grow sb_usable_dblocks along with
+        * sb_dblocks.  Any other change to sb_usable_dblocks requires
+        * V1 API to prove that userspace is aware of usable_dblocks.
+        */
+       if (ver == 0 && xfs_sb_version_hasthinspace(mp->m_sb) &&
+           (mp->m_sb.sb_usable_dblocks != mp->m_sb.sb_dblocks ||
+            nb < mp->m_sb.sb_dblocks))
+               return -EINVAL;
+
        if (nb < mp->m_sb.sb_dblocks || pct < 0 || pct > 100)
                return -EINVAL;

----------------

When xfs_grow WANTS to change size of fs known to be thin, it should set
in->imaxpct |= 1 << 8;

Dave,

Is that the complication of implementation you were talking about?
Really?

Don't you see that this is the right thing to do w.r.t. API design?
If it were you on the other side of review, I bet you would argued the
same as Brian and myself and use the argument that "The syscall and ioctl
APIs are littered with examples and we should pay heed to the lessons those
mistakes teach us." to oppose implicit API change.

Seriously, my opinion does not carry enough weight in the xfs community
to out weight your opinion and if it weren't for Brian who stepped in to
argue in favor of my claims, I would have given up trying to convince you.

Sorry if this is coming off as too harsh of a response.
The sole motivation behind this argument is to prevent pain in the future.
And you are right, we can never predict the future correctly, except for one
thing - that we WILL encounter use cases that none of us can see right now.

Cheers,
Amir

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-03 11:36                   ` Brian Foster
@ 2017-11-05 22:50                     ` Dave Chinner
  2017-11-06 13:01                       ` Brian Foster
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2017-11-05 22:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Fri, Nov 03, 2017 at 07:36:23AM -0400, Brian Foster wrote:
> On Thu, Nov 02, 2017 at 07:47:40PM -0700, Darrick J. Wong wrote:
> > FWIW the way I've been modelling this patch series in my head is that we
> > format an arbitrarily large filesystem (m_LBA_size) address space on a
> > thinp, feed statfs an "adjusted" size (m_usable_size)i which restricts
> > how much space we can allocate, and now growfs increases or decreases
> > the adjusted size without having to relocate anything or mess with the
> > address space.  If the adjusted size ever exceeds the address space
> > size, then we tack on more AGs like we've always done.  From that POV,
> > there's no need to physically shrink (i.e. relocate) anything (and we
> > can leave that for later/never).

[...]

> For example, suppose we had an absolute crude, barebones implementation
> of physical shrink right now that basically trimmmed the amount of space
> from the end of the fs iff those AGs were completely empty and otherwise
> returned -EBUSY. There is no other userspace support, etc. As such, this
> hypothetical feature is extremely limited to being usable immediately
> after a growfs and thus probably has no use case other than "undo my
> accidental growfs."
> 
> If we had that right now, _then_ what would the logical shrink interface
> look like?

Absolutely no different to what I'm proposing we do right now. That
is, the behaviour of the "shrink to size X" ioctl is determined by
the feature bit in the superblock.  Hence if the thinspace feature
is set we do a thin shrink, and if it is not set we do a physical
shrink. i.e. grow/shrink behaviour is defined by the kernel
implementation, not the user or the interface.

As it is, I still can't see a use case or compelling reason for
physically shrinking a thin filesystem. What's the use case that
leads you to think that we need to physically shrink a thin
filesystem, Brian? Let's get that on the table first, rather than
waste time discussing hypothetical what-if's....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-03 11:26                 ` Brian Foster
  2017-11-03 12:19                   ` Amir Goldstein
@ 2017-11-05 23:51                   ` Dave Chinner
  2017-11-06 13:07                     ` Brian Foster
  1 sibling, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2017-11-05 23:51 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Fri, Nov 03, 2017 at 07:26:27AM -0400, Brian Foster wrote:
> On Fri, Nov 03, 2017 at 10:30:17AM +1100, Dave Chinner wrote:
> > On Thu, Nov 02, 2017 at 07:25:33AM -0400, Brian Foster wrote:
> > > On Thu, Nov 02, 2017 at 10:53:00AM +1100, Dave Chinner wrote:
> > > > On Wed, Nov 01, 2017 at 10:17:21AM -0400, Brian Foster wrote:
> > > > > On Wed, Nov 01, 2017 at 11:45:13AM +1100, Dave Chinner wrote:
> > > > > > On Tue, Oct 31, 2017 at 07:24:32AM -0400, Brian Foster wrote:
> > > > > > > On Tue, Oct 31, 2017 at 08:09:41AM +1100, Dave Chinner wrote:
> > > > > > > > On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> > > > > > > > > On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> ...
> > > > > BTW, was there ever any kind of solution to the metadata block
> > > > > reservation issue in the thin case? We now hide metadata reservation
> > > > > from the user via the m_usable_blocks account. If m_phys_blocks
> > > > > represents a thin volume, how exactly do we prevent those metadata
> > > > > allocations/writes from overrunning what the admin has specified as
> > > > > "usable" with respect to the thin volume?
> > > > 
> > > > The reserved metadata blocks are not accounted from free space when
> > > > they are allocated - they are pulled from the reserved space that
> > > > has already been removed from the free space.
> > > > 
> > > 
> > > Ok, so the user can set a usable blocks value of something less than the
> > > fs geometry, then the reservation is pulled from that, reducing the
> > > reported "usable" value further. Hence, what ends up reported to the
> > > user is actually something less than the value set by the user, which
> > > means that the filesystem overall respects how much space the admin says
> > > it can use in the underlying volume.
> > > 
> > > For example, the user creates a 100T thin volume with 10T of usable
> > > space. The fs reserves a further 2T out of that for metadata, so then
> > > what the user sees is 8T of writeable space.  The filesystem itself
> > > cannot use more than 10T out of the volume, as instructed. Am I
> > > following that correctly? If so, that sounds reasonable to me from the
> > > "don't overflow my thin volume" perspective.
> > 
> > No, that's not what happens. For thick filesystems, the 100TB volume
> > gets 2TB pulled from it so it appears as a 98TB filesystem. This is
> > done by modifying the free block counts and m_usable_space when the
> > reservations are made.
> > 
> 
> Ok..
> 
> > For thin filesystems, we've already got 90TB of space "reserved",
> > and so the metadata reservations and allocations come from that.
> > i.e. we skip the modification of free block counts and m_usable
> > space in the case of a thinspace filesystem, and so the user still
> > sees 10TB of usable space that they asked to have.
> > 
> 
> Hmm.. so then I'm slightly confused regarding the thin use case
> regarding prevention of pool depletion. The usable blocks value that the
> user settles on is likely based on how much space the filesystem should
> use to safely avoid pool depletion.

I did say up front that the user data thinspace accounting would not
be an exact reflection of underlying storage pool usage. Things like
partially written blocks in the underlying storage pool mean write
amplification factors would need to be considered, but that's
something the admin already has to deal with in thinly provisioned
storage.

> If a usable value of 10T means the
> filesystem can write to the usable 10T + some amount of metadata
> reservation, how does the user determine a sane usable value based on
> the current pool geometry?

>From an admin POV it's damn easy to document in admin guides that
actual space usage of a thinspace filesysetm is going to be in the
order of 2% greater than the space given to the filesystem for user
data. Use an overhead of 2-5% for internal management and the "small
amount of extra space for internal metadata" issue can be ignored.

> > > The best I can read into the response here is that you think physical
> > > shrink is unlikely enough to not need to care very much what kind of
> > > interface confusion could result from needing to rev the current growfs
> > > interface to support physical shrink on thin filesystems in the future.
> > > Is that a fair assessment..?
> > 
> > Not really. I understand just how complex a physical shrink
> > implementation is going to be, and have a fair idea of the sorts of
> > craziness we'll need to add to xfs_growfs to support/co-ordinate a
> > physical shrink operation.  From that perspective, I don't see a
> > physical shrink working with an unchanged growfs interface. The
> > discussion about whether or not we should physically shrink
> > thinspace filesystems is almost completely irrelevant to the
> > interface requirements of a physical shrink....
> 
> So it's not so much about the likelihood of realizing physical shrink,
> but rather the likelihood that physical shrink would require to rev the
> growfs structure anyways (regardless of this feature).

Yup, pretty much.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-03 12:19                   ` Amir Goldstein
@ 2017-11-06  1:16                     ` Dave Chinner
  2017-11-06  9:48                       ` Amir Goldstein
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2017-11-06  1:16 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Brian Foster, linux-xfs

On Fri, Nov 03, 2017 at 02:19:15PM +0200, Amir Goldstein wrote:
> Whether or not we implement physical shirk is not the main issue.
> The main issue is implicitly changing the meaning of an existing API.

I don't intend on changing the existing API.

I said "there are bugs in the RFC code, and I'm addressing them".

> This is how I propose to smooth in the new API with as little pain as
> possible for existing deployments, yet making sure that "usable block"
> is only modified by new programs that intend to modify it:
> 
> ----------------------
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 8f22fc579dbb..922798ebf3e8 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -171,10 +171,25 @@ xfs_growfs_data_private(
>         xfs_rfsblock_t          nfree;
>         xfs_agnumber_t          oagcount;
>         int                     pct;
> +       unsigned                ver;
>         xfs_trans_t             *tp;
> 
>         nb = in->newblocks;
> -       pct = in->imaxpct;
> +       pct = in->imaxpct & 0xff;
> +       ver = in->imaxpct >> 8;
> +       if (ver > 1)
> +               return -EINVAL;
> +
> +       /*
> +        * V0 API is only allowed to grow sb_usable_dblocks along with
> +        * sb_dblocks.  Any other change to sb_usable_dblocks requires
> +        * V1 API to prove that userspace is aware of usable_dblocks.
> +        */
> +       if (ver == 0 && xfs_sb_version_hasthinspace(mp->m_sb) &&
> +           (mp->m_sb.sb_usable_dblocks != mp->m_sb.sb_dblocks ||
> +            nb < mp->m_sb.sb_dblocks))
> +               return -EINVAL;
> +
>         if (nb < mp->m_sb.sb_dblocks || pct < 0 || pct > 100)
>                 return -EINVAL;
> 
> ----------------
> 
> When xfs_grow WANTS to change size of fs known to be thin, it should set
> in->imaxpct |= 1 << 8;
> 
> Dave,
> 
> Is that the complication of implementation you were talking about?
> Really?

No, it's not. Seems like everyone is still yelling over me instead
of listening.

Let's start with a demonstration. I'm going to make a thin
filesystem on a kernel running my current patchset, then I'm going
to use an old xfs_growfs binary (from 4.12) to try to grow and
shrink that filesystem.

So, create, mount:

$ sudo ~/packages/mkfs.xfs -f -dthin,size=1g /dev/pmem0
Default configuration sourced from package build definitions
meta-data=/dev/pmem0             isize=512    agcount=8, agsize=262144 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
data     =                       bsize=4096   blocks=2097152, imaxpct=25, thinblocks=262144
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount /dev/pmem0 /mnt/test
$

Let's now use the old growfs to look at the filesystem:

$ sudo xfs_growfs -V
xfs_growfs version 4.12.0
$ sudo xfs_info /mnt/test
meta-data=/dev/pmem0             isize=512    agcount=8, agsize=262144 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=262144, imaxpct=25, thin=0
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$

It reports as a 1GB filesystem, but has 8 AGs of 1GB each. Clearly
a thinspace filesystem, but grwofs doesn't know that. The kernel
reports it's size as:

$ df -h /mnt/test
Filesystem      Size  Used Avail Use% Mounted on
/dev/pmem0     1006M   33M  974M   4% /mnt/test
$

Now, let's grow it:

$ sudo xfs_growfs -D 500000 /mnt/test
meta-data=/dev/pmem0             isize=512    agcount=8, agsize=262144 blks
[...]
data blocks changed from 262144 to 500000
$

And the kernel reported size:
$ df -h /mnt/test
Filesystem      Size  Used Avail Use% Mounted on
/dev/pmem0      1.9G   33M  1.9G   2% /mnt/test
$

And xfs_info for good measure:

$ sudo xfs_info /mnt/test
meta-data=/dev/pmem0             isize=512    agcount=8, agsize=262144 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=500000, imaxpct=25, thin=0
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$

The thinspace grow succeeded exactly as it should - it grew to
500k blocks without changing the physical filesystem. Thinspace is
completely transparent to the grow operation now, and will work with
old growfs binaries and apps just fine.

So, onto shrink with an old grwofs binary:

$ sudo xfs_growfs -D 400000 /mnt/test
meta-data=/dev/pmem0             isize=512    agcount=8, agsize=262144 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=500000, imaxpct=25, thin=0
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data size 400000 too small, old size is 500000
$

Oh, look, the old grwofs binary refused to shrink the filesystem.
It didn't even call into the filesystem, it just assumes that you
can't shrink XFS filesystems and doesn't even try.

So, this means old userspace and new thinspace filesystems are
pretty safe without having to change the interface. And tells us
that the main change the userspace growfs code needs is to allow
shrink to proceed. Hence, the new XFS_IOC_FSGEOMETRY ioctl and this
change:

-               if (!error && dsize < geo.datablocks) {
+               if (dsize < geo.datablocks && !thin_enabled) {
                        fprintf(stderr, _("data size %lld too small,"
                                " old size is %lld\n"),
                                (long long)dsize, (long long)geo.datablocks);

So, now the growfs binary can only shrink filesystems that report
as thinspace filesystems. It will still refuse to shrink filesystems
that cannot be shrunk. With a new binary:

$ sudo xfs_growfs -V
xfs_growfs version 4.13.1
$ sudo xfs_growfs -D 400000 /mnt/test
meta-data=/dev/pmem0             isize=512    agcount=8, agsize=262144 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=500000, imaxpct=25, thin=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
data blocks changed from 500000 to 400000
$ df -h /mnt/test
Filesystem      Size  Used Avail Use% Mounted on
/dev/pmem0      1.6G   33M  1.5G   3% /mnt/test
$

We can see it shrinks a thinspace filesystem just fine. A normal
thick filesystem:

$ sudo ~/packages/mkfs.xfs /dev/pmem1
Default configuration sourced from package build definitions
meta-data=/dev/pmem1             isize=512    agcount=4, agsize=524288 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=0, reflink=0
data     =                       bsize=4096   blocks=2097152, imaxpct=25, thinblocks=0
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo mount /dev/pmem1 /mnt/scratch
$ sudo xfs_growfs -D 400000 /mnt/scratch
[....]
data size 400000 too small, old size is 2097152
$

Fails in exactly the same way as they did before.

IOWs, the only thing that growfs requires was awareness that it
could shrink a filesystem (a XFS_IOC_FSGEOMETRY change) and then
without any other changes it can operate sanely on thinspace
filesystems. There is no need to change the actual
XFS_IOC_FSGROWDATA ioctl, because it doesn't care what the
underlying implementation does or supports....

----

That's a demonstration that the growfs interface does not need to
change for userspace to retain sane, predicatable, obvious behaviour
across both old and new growfs binaries with the introduction of
thinspace filesystems. People calling the growfs ioctl directly
without adding thinspace awareness will just see some filesystems
succeed when shrinking - they'll need to add checks for the
thinspace flag in the XFS_IOC_FSGEOMETRY results if they want to do
anything smarter, but otherwise thinspace grow and shrink will just
/work as expected/ with existing binaries.

Amir, my previous comments about unnecessary interface complication
is based on these observations. You seem to be making an underlying
assumption that existing binaries can't grow/shrink thinspace
filesystems without modification. This assumption is incorrect.
Indeed, code complexity isn't the issue here - the issue is the
smearing of implementation details into an abstract control command
that is completely independent of the underlying implementations.

Indeed, userspace does not need to be aware of whether the
filesystem is a thinspace filesystem or not - it just needs to know
the current size so that if it doesn't need to change the size (e.g.
imaxpct change) it puts the right value into the growfs control
structure.

That is the bit I got wrong in the RFC code I posted - the
XFS_IOC_FSGEOMETRY ioctl changes put the wrong size in
geo->datablocks for thinspace filesystems. I've said this multiple
times now, but the message doesn't seem to have sunk in - if we put
the right data in XFS_IOC_FSGEOMETRY, then userspace doesn't need to
care about whether it's a thin or a thick filesystem that it is
operating on.

However, if we start changing the growfs ioctl because it
[apparently] needs to be aware of thin/thick filesystems, we
introduce kernel/userspace compatibility issues that currently don't
exist. And, as I've clearly demonstrated, those interface
differences *do not need to exist* for both old and new growfs
binaries to work correctly with thinspace fielsystems.

That's the "complication of implementation" I was refering to -
introducing interface changes that would result in extra changes to
userspace binaries and as such introduce user visible binary
compatibility issues that users do not need to (and should not) be
exposed to. Not to mention other application developers that might
be using the existing geometry and grwofs ioctls - shrink will now
just work on existing binaries without them having to do
anything....

> Don't you see that this is the right thing to do w.r.t. API design?

No, I don't, because you're trying to solve a problem that, quite
simply, doesn't exist.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-06  1:16                     ` Dave Chinner
@ 2017-11-06  9:48                       ` Amir Goldstein
  2017-11-06 21:46                         ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Amir Goldstein @ 2017-11-06  9:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Mon, Nov 6, 2017 at 3:16 AM, Dave Chinner <david@fromorbit.com> wrote:
...

> Not to mention other application developers that might
> be using the existing geometry and grwofs ioctls - shrink will now

Acknowledging that those "other application" may exist in the wild
makes it even harder to claim that allowing to change usable_dblocks
with existing API is not going to cause pain for users...

> just work on existing binaries without them having to do
> anything....
>
>> Don't you see that this is the right thing to do w.r.t. API design?
>
> No, I don't, because you're trying to solve a problem that, quite
> simply, doesn't exist.
>

It is *very* possible that you are right, but you have not proven that
the problem does not exist. You have proven that the problem
does not exist w.r.t old xfs_grow -D <size> and you correctly
claimed that the problem with old xfs_grow -m <imaxpct> is an
implementation bug with RFC patches.

Let me give an example that will demonstrate my concern.

One of our older NAS products, still deployed with many customers
has LVM based volume manager and ext3 file system.
When user changes the size of a volume via Web UI, lower level
commands will resize LVM and then resize2fs to max size.
Because "resize2fs to max size" is not an atomic operation and
because this is a "for dummies" product, in order to recover from
"half-resize", there is a post-mount script that runs resize2fs
unconditionally after boot.

So in this product, the LVM volume size is treated as an "intent log"
for file system size grow operation.

I find it hard to believe that this practice is so novel that nobody
else ever used it and for that matter with xfs file system over LVM
and xfs_grow -d.

Now imagine you upgrade such a system to a kernel that supports
"thinspace" and new xfsprogs and create thin file systems, and then
downgrade the system to a kernel that sill supports "thinspace", but
xfsprogs that do not (or even a proprietary system component that
uses XFS_IOC_FSGROWDATA ioctl to perform the "auto-grow").

The results will be that all the thin file systems will all "auto-grow"
to the thick size of the volume.

So the way I see it, my proposal to require explicitly
XFS_IOC_FSGROWDATA API V1 for any change to usable_dblocks
that is not coupled with same change to dblocks is meant to resolve
userspace/kernel compatibility issues.

And I fail to see how that requirement makes it hard to maintain
userspace/kernel compatibility:
- xfs_growfs needs to check for "thinspace" flag and if exists use V1 API
- old kernel can't mount "thinspace" fs, so it can never see V1 API
  unless from a buggy program, that will get -EINVAL
- old xfs_growfs will keep failing to shrink even a thin fs
- old xfs_growfs will succeed to grow, except (*) for a thin fs that
was previously shrunk

(*) That exception is relating to the example I described above,
and we seem to not be in agreement about the desired behavior.

IIUC, you like the fact that old xfs_grow can grow a thin and shrunk fs
where I see troubled lurking in this behavior.

So we can agree to disagree on the desired behavior, but for the
record, this and only this point is the API design flaw I am talking
about.

There may be complexities w.r.t maintaining userspace/kernel compatibility
with the proposed solution. I trust you on this because you have far
more experience than me with maintaining historic baggage of wrongly
designed APIs.

If no one else is concerned about the old xfs_grow -d use case and no one
else shares my opinion about the desired behavior in that use case, then
I withdraw my claims.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-05 22:50                     ` Dave Chinner
@ 2017-11-06 13:01                       ` Brian Foster
  2017-11-06 21:20                         ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2017-11-06 13:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Mon, Nov 06, 2017 at 09:50:28AM +1100, Dave Chinner wrote:
> On Fri, Nov 03, 2017 at 07:36:23AM -0400, Brian Foster wrote:
> > On Thu, Nov 02, 2017 at 07:47:40PM -0700, Darrick J. Wong wrote:
> > > FWIW the way I've been modelling this patch series in my head is that we
> > > format an arbitrarily large filesystem (m_LBA_size) address space on a
> > > thinp, feed statfs an "adjusted" size (m_usable_size)i which restricts
> > > how much space we can allocate, and now growfs increases or decreases
> > > the adjusted size without having to relocate anything or mess with the
> > > address space.  If the adjusted size ever exceeds the address space
> > > size, then we tack on more AGs like we've always done.  From that POV,
> > > there's no need to physically shrink (i.e. relocate) anything (and we
> > > can leave that for later/never).
> 
> [...]
> 
> > For example, suppose we had an absolute crude, barebones implementation
> > of physical shrink right now that basically trimmmed the amount of space
> > from the end of the fs iff those AGs were completely empty and otherwise
> > returned -EBUSY. There is no other userspace support, etc. As such, this
> > hypothetical feature is extremely limited to being usable immediately
> > after a growfs and thus probably has no use case other than "undo my
> > accidental growfs."
> > 
> > If we had that right now, _then_ what would the logical shrink interface
> > look like?
> 
> Absolutely no different to what I'm proposing we do right now. That
> is, the behaviour of the "shrink to size X" ioctl is determined by
> the feature bit in the superblock.  Hence if the thinspace feature
> is set we do a thin shrink, and if it is not set we do a physical
> shrink. i.e. grow/shrink behaviour is defined by the kernel
> implementation, not the user or the interface.
> 

I don't buy that argument at all. ;) What you describe above may be
reasonable for the current situation where shrink doesn't actually exist
(or thin comes first), but the above example assumes that there is at
least one simple and working physical shrink use case wired up to the
existing interface already. What you suggest means we would change the
behavior of a _working_ interface to do something completely different
based on a feature bit in the filesystem.

> As it is, I still can't see a use case or compelling reason for
> physically shrinking a thin filesystem. What's the use case that
> leads you to think that we need to physically shrink a thin
> filesystem, Brian? Let's get that on the table first, rather than
> waste time discussing hypothetical what-if's....
> 

I think you're missing the point. I'm really not that concerned about
whether we ultimately allow physical shrink or not on thin filesystems.
We can make that decision down the road. As mentioned previously, the
decision to support physical shrink on a thin enabled filesystem is
distinct from preserving the ability to do so via the current interface.

My concern is that the decision to override this interface has the
potential to create a mess later, both with the kernel interface and
from the perspective of xfsprogs usability. I don't want to try and
repeat the weird corner cases where I think that could materialize
because I can't seem to get that across well enough for you to consider
the requisite conditions. Instead, perhaps I'll just try to describe
what I think this interface should look like at a high level...

- Define a new version of struct xfs_growfs_data that includes a
  new_blocks field, new_usable_blocks field and imaxpct. Also include
  whatever mechanical changes are necessary to rev. the interface (i.e.,
  version number, padding, etc.).

  This means that the updated growfs interface supports the ability to
  physically and logically grow/shrink independent from whatever feature
  decisions we make in the future. This also means the logical
  grow/shrink interface is a bit more flexible because we don't have to
  enforce logical grow on physical grow. Finally, if we ever do support
  physical and logical shrink together, the potential for having to
  consider whether a growfs_data->newblocks shrink command means logical
  shrink because it comes from an older (but post-thin) xfsprogs or
  physical shrink because it comes from some 3rd party application goes
  away completely. The kernel interface is clear and well-defined.

For userspace, we have at least a couple options:

- Implement the same behavior as you've already proposed: physically and
  logically grow together, logical shrink only when hasthin == true,
  otherwise return an error. The point here is that using a more
  flexible kernel interface doesn't preclude/enforce how we decide to
  expose this feature in xfsprogs, and afaict doesn't introduce any new
  backwards compatibility issues since we have to update xfsprogs
  anyways. Older xfsprogs would retain the same behavior it has today.

Or...

- Create a separate logical grow/shrink parameter in xfs_growfs (i.e., a
  -T param analogous to -D for physical blocks). This ensures that
  logical shrink/grow can execute independent from physical shrink/grow,
  that there is no potential for confusion over logical vs. physical
  shrink on thin filesystems and in particular, that if we do ever
  support physical shrink but do _not_ support it on thin fs', that
  there is a distinct physical shrink command that _will return an error
  from the kernel_ even after userspace grows support for physical
  shrink, rather than succeed doing something other than what the user
  might have anticipated.

  Note that even with a separate thin parameter, I think we could still
  consider support of logical+physical grow via the xfs_growfs -d
  parameter.

I find the first option unwise because it similarly confuses userspace
syntax of future physical shrink. For example, that causes me to start
to think about things like whether some user could come along after we
support physical shrink (particularly if we don't support it on thin
fs), run the "shrink the data section" command mistakenly thinking it
freed up block address space and then run the corresponding lvresize
command to shrink the volume since the fs shrink succeeded (which leaves
the user in a data loss scenario).

That is primarily due to user error of course, but maybe the user in
that example simply picked the wrong volume out of tens or hundreds of
others and didn't realize it was thin in the first place. I'm sure we'll
have documentation and whatnot to absolve us from blame, but IMO this is
all much more usable with distinct interfaces for physical vs. logical
adjustments.

In summary, my arguments here consist mostly of a collection of red
flags that I see rather than hard incompatibilities or specific use
cases I want to support. The problematic situations change depending on
whether we decide to support physical shrink on thin fs or not and so
it's not really possible or important to try and pin them all down.
OTOH, it's also quite possible that none of them ever materialize at
all.

If they do, I'm pretty sure we could find ways to address each one
individually as we progress, or document potentially confusing behavior
appropriately, etc. The larger point is that I think much of this simply
goes away with a cleaner interface. IMO, this boils down to what I think
is just a matter of practicing good software engineering and system/user
interface design.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-05 23:51                   ` Dave Chinner
@ 2017-11-06 13:07                     ` Brian Foster
  0 siblings, 0 replies; 47+ messages in thread
From: Brian Foster @ 2017-11-06 13:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Nov 06, 2017 at 10:51:04AM +1100, Dave Chinner wrote:
> On Fri, Nov 03, 2017 at 07:26:27AM -0400, Brian Foster wrote:
> > On Fri, Nov 03, 2017 at 10:30:17AM +1100, Dave Chinner wrote:
> > > On Thu, Nov 02, 2017 at 07:25:33AM -0400, Brian Foster wrote:
> > > > On Thu, Nov 02, 2017 at 10:53:00AM +1100, Dave Chinner wrote:
> > > > > On Wed, Nov 01, 2017 at 10:17:21AM -0400, Brian Foster wrote:
> > > > > > On Wed, Nov 01, 2017 at 11:45:13AM +1100, Dave Chinner wrote:
> > > > > > > On Tue, Oct 31, 2017 at 07:24:32AM -0400, Brian Foster wrote:
> > > > > > > > On Tue, Oct 31, 2017 at 08:09:41AM +1100, Dave Chinner wrote:
> > > > > > > > > On Mon, Oct 30, 2017 at 09:31:17AM -0400, Brian Foster wrote:
> > > > > > > > > > On Thu, Oct 26, 2017 at 07:33:08PM +1100, Dave Chinner wrote:
> > ...
> > > > > > BTW, was there ever any kind of solution to the metadata block
> > > > > > reservation issue in the thin case? We now hide metadata reservation
> > > > > > from the user via the m_usable_blocks account. If m_phys_blocks
> > > > > > represents a thin volume, how exactly do we prevent those metadata
> > > > > > allocations/writes from overrunning what the admin has specified as
> > > > > > "usable" with respect to the thin volume?
> > > > > 
> > > > > The reserved metadata blocks are not accounted from free space when
> > > > > they are allocated - they are pulled from the reserved space that
> > > > > has already been removed from the free space.
> > > > > 
> > > > 
> > > > Ok, so the user can set a usable blocks value of something less than the
> > > > fs geometry, then the reservation is pulled from that, reducing the
> > > > reported "usable" value further. Hence, what ends up reported to the
> > > > user is actually something less than the value set by the user, which
> > > > means that the filesystem overall respects how much space the admin says
> > > > it can use in the underlying volume.
> > > > 
> > > > For example, the user creates a 100T thin volume with 10T of usable
> > > > space. The fs reserves a further 2T out of that for metadata, so then
> > > > what the user sees is 8T of writeable space.  The filesystem itself
> > > > cannot use more than 10T out of the volume, as instructed. Am I
> > > > following that correctly? If so, that sounds reasonable to me from the
> > > > "don't overflow my thin volume" perspective.
> > > 
> > > No, that's not what happens. For thick filesystems, the 100TB volume
> > > gets 2TB pulled from it so it appears as a 98TB filesystem. This is
> > > done by modifying the free block counts and m_usable_space when the
> > > reservations are made.
> > > 
> > 
> > Ok..
> > 
> > > For thin filesystems, we've already got 90TB of space "reserved",
> > > and so the metadata reservations and allocations come from that.
> > > i.e. we skip the modification of free block counts and m_usable
> > > space in the case of a thinspace filesystem, and so the user still
> > > sees 10TB of usable space that they asked to have.
> > > 
> > 
> > Hmm.. so then I'm slightly confused regarding the thin use case
> > regarding prevention of pool depletion. The usable blocks value that the
> > user settles on is likely based on how much space the filesystem should
> > use to safely avoid pool depletion.
> 
> I did say up front that the user data thinspace accounting would not
> be an exact reflection of underlying storage pool usage. Things like
> partially written blocks in the underlying storage pool mean write
> amplification factors would need to be considered, but that's
> something the admin already has to deal with in thinly provisioned
> storage.
> 

Ok, I recall this coming up one way or another. For some reason I
thought something might have changed in the implementation since then
and/or managed to confuse myself over the current behavior.

> > If a usable value of 10T means the
> > filesystem can write to the usable 10T + some amount of metadata
> > reservation, how does the user determine a sane usable value based on
> > the current pool geometry?
> 
> From an admin POV it's damn easy to document in admin guides that
> actual space usage of a thinspace filesysetm is going to be in the
> order of 2% greater than the space given to the filesystem for user
> data. Use an overhead of 2-5% for internal management and the "small
> amount of extra space for internal metadata" issue can be ignored.
> 

It's easy to document whatever we want. :) I'm not convinced that is as
effective as a hard limit based on the fs features, but the latter is
more complex and may be overkill in most cases. So, documentation works
for me until/unless testing or real usage shows otherwise.

If it does come up, perhaps a script or userspace tool that somehow
presents the current internal reservation calculations (combined with
whatever geometry information is relevant) as something consumable for
the user (whether it be a simple dump of the active reservations, the
worst case consumption of a thin fs, etc.) might be a nice compromise.

> > > > The best I can read into the response here is that you think physical
> > > > shrink is unlikely enough to not need to care very much what kind of
> > > > interface confusion could result from needing to rev the current growfs
> > > > interface to support physical shrink on thin filesystems in the future.
> > > > Is that a fair assessment..?
> > > 
> > > Not really. I understand just how complex a physical shrink
> > > implementation is going to be, and have a fair idea of the sorts of
> > > craziness we'll need to add to xfs_growfs to support/co-ordinate a
> > > physical shrink operation.  From that perspective, I don't see a
> > > physical shrink working with an unchanged growfs interface. The
> > > discussion about whether or not we should physically shrink
> > > thinspace filesystems is almost completely irrelevant to the
> > > interface requirements of a physical shrink....
> > 
> > So it's not so much about the likelihood of realizing physical shrink,
> > but rather the likelihood that physical shrink would require to rev the
> > growfs structure anyways (regardless of this feature).
> 
> Yup, pretty much.
> 

Ok. I don't agree, but at least I understand your perspective. ;)

Brian

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-06 13:01                       ` Brian Foster
@ 2017-11-06 21:20                         ` Dave Chinner
  2017-11-07 11:28                           ` Brian Foster
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2017-11-06 21:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Mon, Nov 06, 2017 at 08:01:00AM -0500, Brian Foster wrote:
> On Mon, Nov 06, 2017 at 09:50:28AM +1100, Dave Chinner wrote:
> > On Fri, Nov 03, 2017 at 07:36:23AM -0400, Brian Foster wrote:
> > > On Thu, Nov 02, 2017 at 07:47:40PM -0700, Darrick J. Wong wrote:
> > > > FWIW the way I've been modelling this patch series in my head is that we
> > > > format an arbitrarily large filesystem (m_LBA_size) address space on a
> > > > thinp, feed statfs an "adjusted" size (m_usable_size)i which restricts
> > > > how much space we can allocate, and now growfs increases or decreases
> > > > the adjusted size without having to relocate anything or mess with the
> > > > address space.  If the adjusted size ever exceeds the address space
> > > > size, then we tack on more AGs like we've always done.  From that POV,
> > > > there's no need to physically shrink (i.e. relocate) anything (and we
> > > > can leave that for later/never).
> > 
> > [...]
> > 
> > > For example, suppose we had an absolute crude, barebones implementation
> > > of physical shrink right now that basically trimmmed the amount of space
> > > from the end of the fs iff those AGs were completely empty and otherwise
> > > returned -EBUSY. There is no other userspace support, etc. As such, this
> > > hypothetical feature is extremely limited to being usable immediately
> > > after a growfs and thus probably has no use case other than "undo my
> > > accidental growfs."
> > > 
> > > If we had that right now, _then_ what would the logical shrink interface
> > > look like?
> > 
> > Absolutely no different to what I'm proposing we do right now. That
> > is, the behaviour of the "shrink to size X" ioctl is determined by
> > the feature bit in the superblock.  Hence if the thinspace feature
> > is set we do a thin shrink, and if it is not set we do a physical
> > shrink. i.e. grow/shrink behaviour is defined by the kernel
> > implementation, not the user or the interface.
> > 
> 
> I don't buy that argument at all. ;) What you describe above may be
> reasonable for the current situation where shrink doesn't actually exist
> (or thin comes first),

Which is the case we are discussing here. thinspace shrink is here,
now, physical shrink is no closer than it was 10 years ago. So it's
reasonable to design changes around the needs of thinspace shrink
because physical shrink is still be years away (if ever).

> but the above example assumes that there is at
> least one simple and working physical shrink use case wired up to the
> existing interface already.

IOWs, this is a strawman argument that involves designing an API to
suit the strawman.

[....]

> In summary, my arguments here consist mostly of a collection of red
> flags that I see rather than hard incompatibilities or specific use
> cases I want to support. The problematic situations change depending on
> whether we decide to support physical shrink on thin fs or not and so
> it's not really possible or important to try and pin them all down.
> OTOH, it's also quite possible that none of them ever materialize at
> all.

And that's the point I keep making: we don't know which of the
strawmen being presented are going to matter (if at all) until we
have physical shrink designed and are deep into the implementation.

IOWs, trying to work out the future API needs of a physical shrink
is just a guessing game right now.

> If they do, I'm pretty sure we could find ways to address each one
> individually as we progress, or document potentially confusing behavior
> appropriately, etc. The larger point is that I think much of this simply
> goes away with a cleaner interface. IMO, this boils down to what I think
> is just a matter of practicing good software engineering and system/user
> interface design.

Yes, but designing based on a /guess/ is *bad engineering practice*.
It almost always ends up wrong and has to be reworked, and that
means we get stuck supporting an API we don't need or want forever
more.

Yes, we've categorised the risk that we might need an interface
change in future - as we should - but we don't know which of those
risks are going to materialise.  IOWs, we can't solve this interface
problem with the information or insight we currently have - we need
to implement physical shrink and determine which of these risks
actually materialise, and then we can address the interface issue
knowing that we're solving the problems that physical shrink
introduces.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-06  9:48                       ` Amir Goldstein
@ 2017-11-06 21:46                         ` Dave Chinner
  2017-11-07  5:30                           ` Amir Goldstein
  0 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2017-11-06 21:46 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Brian Foster, linux-xfs

On Mon, Nov 06, 2017 at 11:48:05AM +0200, Amir Goldstein wrote:
> On Mon, Nov 6, 2017 at 3:16 AM, Dave Chinner <david@fromorbit.com> wrote:
> > just work on existing binaries without them having to do
> > anything....
> >
> >> Don't you see that this is the right thing to do w.r.t. API design?
> >
> > No, I don't, because you're trying to solve a problem that, quite
> > simply, doesn't exist.
> >
> 
> It is *very* possible that you are right, but you have not proven that
> the problem does not exist. You have proven that the problem
> does not exist w.r.t old xfs_grow -D <size> and you correctly
> claimed that the problem with old xfs_grow -m <imaxpct> is an
> implementation bug with RFC patches.
> 
> Let me give an example that will demonstrate my concern.
> 
> One of our older NAS products, still deployed with many customers
> has LVM based volume manager and ext3 file system.
> When user changes the size of a volume via Web UI, lower level
> commands will resize LVM and then resize2fs to max size.
> Because "resize2fs to max size" is not an atomic operation and
> because this is a "for dummies" product, in order to recover from
> "half-resize", there is a post-mount script that runs resize2fs
> unconditionally after boot.

Sure, but if you have a product using thinspace filesystems then you
are going to need to do something different. To think you can just
plug a thinspace filesystem into an existing stack and have it work
unmodified is naive at best.

Indeed, with a thinspace filesystem, the LVM volume *never needs
resizing* because it will be created far larger than ever required
in the first place. And recovering from a crash half way through a
thinspace growfs operation? Well, it's all done through the
filesystem journal because it's all done through transactions.

IOWs, thinspace filesystems solve this "atomic resize" problem
without needing crutches or external constructs to hide non-atomic
operations.

Essentially, if you have a storage product and you plug thinspace
filesystems into it without management modifications, then you get
to keep all the bits that break and don't work correctly.

All I care is that stupid things don't happen if someone has
integrated thinspace filesystems correctly and then a user
does this:

> Now imagine you upgrade such a system to a kernel that supports
> "thinspace" and new xfsprogs and create thin file systems, and then
> downgrade the system to a kernel that sill supports "thinspace", but
> xfsprogs that do not (or even a proprietary system component that
> uses XFS_IOC_FSGROWDATA ioctl to perform the "auto-grow").

In this case the thinspace filesystems will behave exactly like a
physical filesystem. i.e. they will physically grow to the size of
the underlying device. I can't stop this from happening, but I can
ensure that it doesn't do irreversable damage and that it's
reversible as soon as userspace is restored to suport thinspace
administration again. i.e. just shrink it back down to the required
thin size, and it's like that grow never occurred...

i.e. it's not the end of the world, and you can recover cleanly from
it without any issues.

> The results will be that all the thin file systems will all "auto-grow"
> to the thick size of the volume.

Of course it will - the user/app/admin asked the kernel to grow the
filesystem to the size of the underlying device.  I don't know what
you expect a thinspace filesystem to do here other than *grow the
filesystem to the size of the underlying device*.

> So we can agree to disagree on the desired behavior, but for the
> record, this and only this point is the API design flaw I am talking
> about.

What flaw?

You've come up with a scenario where the thinspace filesystem will
behave like physical filesystem because that's what the thinspace
unaware admin program expects it to be. i.e. thinspace is completely
transparent to userspace, and physical grow does not prevent
subsequent shrinks back to the original thinspace size.

None of this indicates that there is an interface problem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-06 21:46                         ` Dave Chinner
@ 2017-11-07  5:30                           ` Amir Goldstein
  0 siblings, 0 replies; 47+ messages in thread
From: Amir Goldstein @ 2017-11-07  5:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Mon, Nov 6, 2017 at 11:46 PM, Dave Chinner <david@fromorbit.com> wrote:
[...]
>> commands will resize LVM and then resize2fs to max size.
>> Because "resize2fs to max size" is not an atomic operation and
>> because this is a "for dummies" product, in order to recover from
>> "half-resize", there is a post-mount script that runs resize2fs
>> unconditionally after boot.
>
> Sure, but if you have a product using thinspace filesystems then you
> are going to need to do something different. To think you can just
> plug a thinspace filesystem into an existing stack and have it work
> unmodified is naive at best.
>

I do not expect it to "work" on the contrary - I expect it not to work,
just the same as xfs_repair will refuse to repair an xfs with unknown
feature flags.

[...]
>
>> Now imagine you upgrade such a system to a kernel that supports
>> "thinspace" and new xfsprogs and create thin file systems, and then
>> downgrade the system to a kernel that sill supports "thinspace", but
>> xfsprogs that do not (or even a proprietary system component that
>> uses XFS_IOC_FSGROWDATA ioctl to perform the "auto-grow").
>
> In this case the thinspace filesystems will behave exactly like a
> physical filesystem. i.e. they will physically grow to the size of
> the underlying device. I can't stop this from happening, but I can
> ensure that it doesn't do irreversable damage and that it's
> reversible as soon as userspace is restored to suport thinspace
> administration again. i.e. just shrink it back down to the required
> thin size, and it's like that grow never occurred...
>
> i.e. it's not the end of the world, and you can recover cleanly from
> it without any issues.
>

Very true, not the end of the world. That is why your design is something
I can "live with". But it does have potential to cause pain downstream in
the future and I just don't see any reason why this pain cannot be avoided.
I fail to see the downside of not allowing old xfs_grow to modify thin space.

>> The results will be that all the thin file systems will all "auto-grow"
>> to the thick size of the volume.
>
> Of course it will - the user/app/admin asked the kernel to grow the
> filesystem to the size of the underlying device.  I don't know what
> you expect a thinspace filesystem to do here other than *grow the
> filesystem to the size of the underlying device*.

I expect kernel to tell user EINVAL and warn that user needs to use newer
xfsprogs to auto grow thin space.

Let me re-iterate the requirement we are disagreeing on:
- old xfs_growfs will succeed to grow, *except* for a thin fs that
  was previously shrunk (i.e. dblocks != usable_dblocks)

You explained at length why the exception is not a must.
I do not remember a single argument that explains what's
wrong with keeping the exception.
I claimed that this exception can reduce pain to end users.
In response, you wrote that "user/app/admin asked to grow fs to
maximum size" and in so many words that they can "keep the pieces".

What bad things can happen if the clueless user/app/admin is refused
to grow fs to maximum size?
The practice of "not sure you know what you are doing so please keep
away" has been a very good practice for xfs and file systems for years.
Why not abide by this law in this case?

Cheers,
Amir.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems
  2017-11-06 21:20                         ` Dave Chinner
@ 2017-11-07 11:28                           ` Brian Foster
  0 siblings, 0 replies; 47+ messages in thread
From: Brian Foster @ 2017-11-07 11:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Tue, Nov 07, 2017 at 08:20:46AM +1100, Dave Chinner wrote:
> On Mon, Nov 06, 2017 at 08:01:00AM -0500, Brian Foster wrote:
> > On Mon, Nov 06, 2017 at 09:50:28AM +1100, Dave Chinner wrote:
> > > On Fri, Nov 03, 2017 at 07:36:23AM -0400, Brian Foster wrote:
> > > > On Thu, Nov 02, 2017 at 07:47:40PM -0700, Darrick J. Wong wrote:
> > > > > FWIW the way I've been modelling this patch series in my head is that we
> > > > > format an arbitrarily large filesystem (m_LBA_size) address space on a
> > > > > thinp, feed statfs an "adjusted" size (m_usable_size)i which restricts
> > > > > how much space we can allocate, and now growfs increases or decreases
> > > > > the adjusted size without having to relocate anything or mess with the
> > > > > address space.  If the adjusted size ever exceeds the address space
> > > > > size, then we tack on more AGs like we've always done.  From that POV,
> > > > > there's no need to physically shrink (i.e. relocate) anything (and we
> > > > > can leave that for later/never).
> > > 
> > > [...]
> > > 
> > > > For example, suppose we had an absolute crude, barebones implementation
> > > > of physical shrink right now that basically trimmmed the amount of space
> > > > from the end of the fs iff those AGs were completely empty and otherwise
> > > > returned -EBUSY. There is no other userspace support, etc. As such, this
> > > > hypothetical feature is extremely limited to being usable immediately
> > > > after a growfs and thus probably has no use case other than "undo my
> > > > accidental growfs."
> > > > 
> > > > If we had that right now, _then_ what would the logical shrink interface
> > > > look like?
> > > 
> > > Absolutely no different to what I'm proposing we do right now. That
> > > is, the behaviour of the "shrink to size X" ioctl is determined by
> > > the feature bit in the superblock.  Hence if the thinspace feature
> > > is set we do a thin shrink, and if it is not set we do a physical
> > > shrink. i.e. grow/shrink behaviour is defined by the kernel
> > > implementation, not the user or the interface.
> > > 
> > 
> > I don't buy that argument at all. ;) What you describe above may be
> > reasonable for the current situation where shrink doesn't actually exist
> > (or thin comes first),
> 
> Which is the case we are discussing here. thinspace shrink is here,
> now, physical shrink is no closer than it was 10 years ago. So it's
> reasonable to design changes around the needs of thinspace shrink
> because physical shrink is still be years away (if ever).
> 
> > but the above example assumes that there is at
> > least one simple and working physical shrink use case wired up to the
> > existing interface already.
> 
> IOWs, this is a strawman argument that involves designing an API to
> suit the strawman.
> 
> [....]
> 
> > In summary, my arguments here consist mostly of a collection of red
> > flags that I see rather than hard incompatibilities or specific use
> > cases I want to support. The problematic situations change depending on
> > whether we decide to support physical shrink on thin fs or not and so
> > it's not really possible or important to try and pin them all down.
> > OTOH, it's also quite possible that none of them ever materialize at
> > all.
> 
> And that's the point I keep making: we don't know which of the
> strawmen being presented are going to matter (if at all) until we

> have physical shrink designed and are deep into the implementation.
> 
> IOWs, trying to work out the future API needs of a physical shrink
> is just a guessing game right now.
> 
> > If they do, I'm pretty sure we could find ways to address each one
> > individually as we progress, or document potentially confusing behavior
> > appropriately, etc. The larger point is that I think much of this simply
> > goes away with a cleaner interface. IMO, this boils down to what I think
> > is just a matter of practicing good software engineering and system/user
> > interface design.
> 
> Yes, but designing based on a /guess/ is *bad engineering practice*.
> It almost always ends up wrong and has to be reworked, and that
> means we get stuck supporting an API we don't need or want forever
> more.
> 

To suggest we're attempting to design a future physical shrink api is a
mischaracterization of the discussion. The suggestion here is quite
simple:

1.) preserve the behavior of the existing api
2.) design a suitable/flexible interface for the thin feature

ISTM that both can be accomplished by adding a single field to
xfs_growfs_data. This reduces risk of interface conflict, adds
flexibility to the controls for thin and facilitates preservation of
expected behavior for existing growfs users (i.e., Amir's use case).
AFAICT, this does not introduce any sort of backwards compatibility
issues for thin and doesn't really require much more effort.

The counter arguments that have been expressed are that the risk may not
materialize into real problems and that the flexibility may not be
necessary. This is certainly true, but IMO the reduced risk and added
flexibility are worth the trivial amount of extra effort. Clearly you do
not agree.

Brian

> Yes, we've categorised the risk that we might need an interface
> change in future - as we should - but we don't know which of those
> risks are going to materialise.  IOWs, we can't solve this interface
> problem with the information or insight we currently have - we need
> to implement physical shrink and determine which of these risks
> actually materialise, and then we can address the interface issue
> knowing that we're solving the problems that physical shrink
> introduces.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2017-11-07 11:28 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-26  8:33 [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Dave Chinner
2017-10-26  8:33 ` [PATCH 01/14] xfs: factor out AG header initialisation from growfs core Dave Chinner
2017-10-26  8:33 ` [PATCH 02/14] xfs: convert growfs AG header init to use buffer lists Dave Chinner
2017-10-26  8:33 ` [PATCH 03/14] xfs: factor ag btree reoot block initialisation Dave Chinner
2017-10-26  8:33 ` [PATCH 04/14] xfs: turn ag header initialisation into a table driven operation Dave Chinner
2017-10-26  8:33 ` [PATCH 05/14] xfs: make imaxpct changes in growfs separate Dave Chinner
2017-10-26  8:33 ` [PATCH 06/14] xfs: separate secondary sb update in growfs Dave Chinner
2017-10-26  8:33 ` [PATCH 07/14] xfs: rework secondary superblock updates " Dave Chinner
2017-10-26  8:33 ` [PATCH 08/14] xfs: move various type verifiers to common file Dave Chinner
2017-10-26  8:33 ` [PATCH 09/14] xfs: split usable space from block device size Dave Chinner
2017-10-26  8:33 ` [PATCH 10/14] xfs: hide reserved metadata space from users Dave Chinner
2017-10-26  8:33 ` [PATCH 11/14] xfs: bump XFS_IOC_FSGEOMETRY to v5 structures Dave Chinner
2017-10-26  8:33 ` [PATCH 12/14] xfs: convert remaingin xfs_sb_version_... checks to bool Dave Chinner
2017-10-26 16:03   ` Darrick J. Wong
2017-10-26  8:33 ` [PATCH 13/14] xfs: add suport for "thin space" filesystems Dave Chinner
2017-10-26  8:33 ` [PATCH 14/14] xfs: add growfs support for changing usable blocks Dave Chinner
2017-10-26 11:30   ` Amir Goldstein
2017-10-26 12:48     ` Dave Chinner
2017-10-26 13:32       ` Amir Goldstein
2017-10-27 10:26         ` Amir Goldstein
2017-10-26 11:09 ` [RFC PATCH 0/14] xfs: Towards thin provisioning aware filesystems Amir Goldstein
2017-10-26 12:35   ` Dave Chinner
2017-11-01 22:31     ` Darrick J. Wong
2017-10-30 13:31 ` Brian Foster
2017-10-30 21:09   ` Dave Chinner
2017-10-31  4:49     ` Amir Goldstein
2017-10-31 22:40       ` Dave Chinner
2017-10-31 11:24     ` Brian Foster
2017-11-01  0:45       ` Dave Chinner
2017-11-01 14:17         ` Brian Foster
2017-11-01 23:53           ` Dave Chinner
2017-11-02 11:25             ` Brian Foster
2017-11-02 23:30               ` Dave Chinner
2017-11-03  2:47                 ` Darrick J. Wong
2017-11-03 11:36                   ` Brian Foster
2017-11-05 22:50                     ` Dave Chinner
2017-11-06 13:01                       ` Brian Foster
2017-11-06 21:20                         ` Dave Chinner
2017-11-07 11:28                           ` Brian Foster
2017-11-03 11:26                 ` Brian Foster
2017-11-03 12:19                   ` Amir Goldstein
2017-11-06  1:16                     ` Dave Chinner
2017-11-06  9:48                       ` Amir Goldstein
2017-11-06 21:46                         ` Dave Chinner
2017-11-07  5:30                           ` Amir Goldstein
2017-11-05 23:51                   ` Dave Chinner
2017-11-06 13:07                     ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.