All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 04/11] xfs: CoW fork operations should only update quota reservations
Date: Wed, 24 Jan 2018 09:22:16 -0500	[thread overview]
Message-ID: <20180124142215.GD37554@bfoster.bfoster> (raw)
In-Reply-To: <151676030319.12349.11227408793644481553.stgit@magnolia>

On Tue, Jan 23, 2018 at 06:18:23PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Since the CoW fork only exists in memory, it is incorrect to update the
> on-disk quota block counts when we modify the CoW fork.  Unlike the data
> fork, even real extents in the CoW fork are only reservations (on-disk
> they're owned by the refcountbt) so they must not be tracked in the on
> disk quota info.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c |  203 ++++++++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_reflink.c     |    8 +-
>  2 files changed, 196 insertions(+), 15 deletions(-)
> 
> 

Mostly comments on the comment so far... I still have to grok the code,
but working through to this point made my brain hurt a bit so I'm
sending what I have so far. ;P

> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 6e6f3cb..e3e8f7c 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -52,6 +52,145 @@
>  #include "xfs_refcount.h"
>  #include "xfs_icache.h"
>  
> +/*
> + * Data/Attr Fork Mapping Lifecycle
> + *
> + * The data fork contains the block mappings between logical blocks in a file
> + * and physical blocks on the disk.  The XFS notions of delayed allocation
> + * reservations, unwritten extents, and real extents follow well known
> + * conventions in the filesystem world.
> + *
> + * As a side note, the attribute fork does the same for extended attribute
> + * blocks, though the logical block offsets are not available to userspace and
> + * the only valid states are HOLE and REAL.
> + *
> + * Metadata involved outside of the block mapping itself are as follows:
> + *
> + * - i_delayed_blks: Number of blocks that are reserved for delayed allocation.
> + * - i_cow_blocks: Number of blocks reserved for copy on write staging.
> + *

I know it's implied by some of the field names, but I think it would be
useful to point out what data structures these are associated with.
Also, i_cow_blocks technically doesn't exist yet.. right?

> + * - di_nblocks: Number of blocks (on-disk) assigned to the inode.
> + *

Hmm, some of these are self-explanatory or already documented where they
are defined. I'm wondering if we really need to repeat descriptions for
all of these (as opposed to ensuring they are all sufficiently described
where they are defined). That also saves us from having to keep field
names and whatnot synced up in an unexpected source location.

> + * - d_bcount: Number of quota blocks accounted for by on-disk metadata.
> + * - q_res_bcount: Number of quota blocks reserved in-core for future writes +
> + *           blocks mentioned by on-disk metadata.
> + *
> + * - qt_blk_res: Number of quota blocks reserved in-core for this transaction.
> + *           Unused reservation is given back to q_res_bcount on commit.
> + * - qt_bcount: Number of quota blocks used by this transaction from
> + *           qt_blk_res.  d_bcount is increased by this on commit.
> + * - qt_delbcount: Number of quota blocks used by this transaction from
> + *           q_res_bcount but not q_res_bcount.  d_bcount is increased by this
> + *           on commit.
> + *

"from q_res_bcount but not q_res_bcount" ?

Also: qt_bcount_delta, qt_delbcount_delta?

> + * - sb_fdblocks: Number of free blocks recorded in the superblock on disk.
> + * - fdblocks: Number of free blocks recorded in the superblock minus any
> + *           in-core reservations made in anticipation of future writes.
> + *
> + * - t_blk_res: Number of blocks reserved out of fdblocks for a transaction.
> + *           When the transaction commits, t_blk_res - t_blk_res_used is given
> + *           back to fdblocks.
> + * - t_blk_res_used: Number of blocks used by this transaction that were
> + *           reserved for this transaction.
> + * - t_fdblocks_del: Number of blocks by which fdblocks and sb_fdblocks will
> + *           have to decrease at commit.
> + * - t_res_fdblocks_delta: Number of blocks by which sb_fdblocks will have to
> + *           decrease at commit.  We assume that fdblocks was decreased
> + *           prior to the transaction.
> + *
> + * Data fork block mappings have four logical states:
> + *
> + *    +--------> UNWRITTEN <------+
> + *    |              ^            |
> + *    |              v            v
> + * DELALLOC <----> HOLE <------> REAL
> + *    |                           ^
> + *    |                           |
> + *    +---------------------------+
> + *

I'm not sure we need a graphic for the extent states. Non-hole
conversions to delalloc is the only transition that doesn't make any
sense.

> + * The state transitions and required metadata updates are as follows:
> + *
> + * - HOLE to DELALLOC: Increase i_delayed_blks and q_res_bcount, and decrease
> + *           fdblocks.
> + * - HOLE to REAL: Increase di_nblocks and qt_bcount, and decrease fdblocks.
> + * - HOLE to UNWRITTEN: Same as above.
> + *
> + * - DELALLOC to UNWRITTEN: Increase di_nblocks and qt_delbcount, and decrease
> + *           i_delayed_blks.
> + * - DELALLOC to REAL: Same as above.
> + * - DELALLOC to HOLE: Increase fdblocks, and decrease i_delayed_blks and
> + *           q_res_bcount.
> + *
> + * - UNWRITTEN to HOLE: Decrease di_nblocks and q_bcount, and increase fdblocks.
> + * - UNWRITTEN to REAL: No change.
> + *
> + * - REAL to UNWRITTEN: No change.
> + * - REAL to HOLE: Decrease di_nblocks and q_bcount, and increase fdblocks.
> + *
> + * Note in particular that delalloc reservations have "transaction-less"
> + * quota reservations via q_res_bcount.  If the reservation is allocated,
> + * qt_delbcount is used to increment d_bcount without touching q_res_bcount.
> + * Filling a hole with an allocated extent, by contrast, uses qt_blk_res
> + * to make a reservation in q_res_bcount, qt_bcount to record the number
> + * of allocated blocks; at commit qt_bcount is added to d_bcount and
> + * qt_blk_res - qt_bcount is added back to q_res_bcount.
> + *

While I agree with the intent/usefulness of a big comment around how the
quota accounting works, this all kind of reads like a braindump so far.
E.g., this state changes these fields, this field represents how much to
change some other field, etc. While I'm sure that is useful information
to work out the problem being resolved here, it doesn't explain much how
everything works. IOW, I still feel like I need to go trace through the
code to understand the comment, as opposed to the comment helping me
trace through the code.

I think it would be better to have a high level description about how
the quota accounting works with respect to the transaction/extent
states, what high-level data structures are involved and how they
relate, etc., rather than what seems essentially to be a chart that maps
states to fields. 

I'm also wondering how much of the whole picture really needs to be
described here to cover quota accounting with respect to block mapping
(sufficiently to differentiate data fork from cow fork, which I take is
the purpose of this section). For example, do we really need the
internal transaction details for how quota deltas are carried?

Instead, I think it might be sufficient to explain that the quota system
works in two "levels" (for lack of a better term :/), one for
reservation and another for real block usage. The reservation is
associated with transaction reservation and/or delayed block allocation
(no tx). In either case, quota reservation is converted to real quota
block usage when a transaction commits that maps real/physical blocks.
If the transaction held extra reservation that went unused, that quota
reservation is released. The primary difference is that transactions to
convert delalloc -> real do not reserve quota blocks in the first place,
since that has already occurred, so they just need to make sure to
convert/persist the quota res for the blocks that were converted to
real.

> + * Copy on Write Fork Mapping Lifecycle
> + *
> + * The CoW fork handles things differently from the data fork because its
> + * mappings only exist in memory-- the refcount btree is the on-disk owner of
> + * the extents until they're remapped into the data fork.  Therefore,
> + * unwritten and real extents in the CoW fork are treated the same way as
> + * delayed allocation extents.  Quota and fdblock changes only exist in
> + * memory, which requires some twists in the bmap functions.
> + *

Ok, but perhaps this should point out what happens when cow blocks are
reserved with respect to quotas..? IIUC, a delalloc quota reservation
occurs just the same as above, the blocks simply reside in another fork.

> + * The CoW fork extent state diagram looks like this:
> + *
> + *    +--------> UNWRITTEN -------+
> + *    |              ^            |
> + *    |              v            v
> + * DELALLOC <----> HOLE <------- REAL
> + *
> + * Holes are still holes.  Delayed allocation extents reserve blocks for
> + * landing future writes, just like they do in the data fork.  However, unlike
> + * the data fork, unwritten extents signal an extent that has been allocated
> + * but is not currently undergoing writeback.  Real extents are undergoing
> + * writeback, and when that writeback finishes the corresponding data fork
> + * extent will be punched out and the CoW fork counterpart moved to the new
> + * hole in the data fork.
> + *

Ok, so the difference is that for the COW fork, the _extent state_
conversion is not the appropriate trigger event to convert quota
reservation to real quota usage. Instead, the blocks being remapped from
the COW fork to the data fork is when that should occur.

> + * The state transitions and required metadata updates are as follows:
> + *
> + * - HOLE to DELALLOC: Increase i_cow_blocks and q_res_bcount, and decrease
> + *           fdblocks.
> + * - HOLE to UNWRITTEN: Same as above, but since we reserved quota via
> + *           qt_blk_res (which increased q_res_bcount) when we allocate the
> + *           extent we have to decrease qt_blk_res so that the commit doesn't
> + *           give the allocated CoW blocks back.
> + *

Hmm, this is a little confusing. Looking at the code change and comment
below, I think I get what this is trying to do, which is essentially
make a real block cow fork alloc behave like a delalloc reservation
(with respect to quota). FWIW, I think what confuses me is the assertion
that the blocks would be "given back" otherwise. The only reference I
have to compare is data fork alloc behavior, which implies that used
reservation would not be given back, but rather converted to real quota
usage on block allocation (and excess reservation would still be given
back, which afaict we still want to happen). So the trickery is required
to prevent conversion of quota reservation for the allocated cow blocks,
let that res sit around until the cow blocks are remapped, and release
unused reservation from the tx as normal. Am I following that correctly?

> + * - DELALLOC to UNWRITTEN: No change.
> + * - DELALLOC to HOLE: Decrease i_cow_blocks and q_res_bcount, and increase
> + *           fdblocks.
> + *
> + * - UNWRITTEN to HOLE: Same as DELALLOC to HOLE.
> + * - UNWRITTEN to REAL: No change.
> + *
> + * - REAL to HOLE: This transition happens when we've finished a write
> + *           operation and need to move the mapping to the data fork.  We
> + *           punch the correspond data fork mappings, which decreases
> + *           qt_bcount.  Then we map the CoW fork mapping into the hole we
> + *           just cleared out of the data fork, which increases qt_bcount.
> + *           There's a subtlety here -- if we promoted a write over a hole to
> + *           CoW, there will be a net increase in qt_bcount, which is fine
> + *           because we already reserved the quota when we filled the CoW
> + *           fork.  Finally, we punch the CoW fork mapping, which decreases
> + *           q_res_bcount.
> + *
> + * Notice how all CoW fork extents use transactionless quota reservations and
> + * the in-core fdblocks to maintain state, and we avoid updating any on-disk
> + * metadata.  This is essential to maintain metadata correctness if the system
> + * goes down.
> + */
>  
>  kmem_zone_t		*xfs_bmap_free_item_zone;
>  
> @@ -3337,6 +3476,39 @@ xfs_bmap_btalloc_filestreams(
>  	return 0;
>  }
>  
> +/* Deal with CoW fork accounting when we allocate a block. */
> +static void
> +xfs_bmap_btalloc_cow(
> +	struct xfs_bmalloca	*ap,
> +	struct xfs_alloc_arg	*args)
> +{
> +	/* Filling a previously reserved extent; nothing to do here. */
> +	if (ap->wasdel)
> +		return;
> +
> +	/*
> +	 * The CoW fork only exists in memory, so the on-disk quota accounting
> +	 * must not incude any CoW fork extents.  Therefore, CoW blocks are
> +	 * only tracked in the in-core dquot block count (q_res_bcount).
> +	 *
> +	 * If we get here, we're filling a CoW hole with a real (non-delalloc)
> +	 * CoW extent having reserved enough blocks from both q_res_bcount and
> +	 * qt_blk_res to guarantee that we won't run out of space.  The unused
> +	 * qt_blk_res is given back to q_res_bcount when the transaction
> +	 * commits.
> +	 *
> +	 * We don't want the quota accounting for our newly allocated blocks
> +	 * to be given back, so we must decrease qt_blk_res without decreasing
> +	 * q_res_bcount.
> +	 *
> +	 * Note: If we're allocating a delalloc extent, we already reserved
> +	 * the q_res_bcount blocks, so no quota accounting update is needed
> +	 * here.
> +	 */
> +	xfs_trans_mod_dquot_byino(ap->tp, ap->ip, XFS_TRANS_DQ_RES_BLKS,
> +			-(long)args->len);
> +}

Factoring nit.. if we're going to refactor bits of xfs_bmap_btalloc()
out, it might be cleaner to factor out all of the quota logic rather
than just the cow bits (which is basically just a simple check and
function call). E.g., refactor into an xfs_bmap_btalloc_quota() helper
that does the right thing based on the fork, with comments as to why,
etc. (and perhaps just leave the unrelated di_nblocks change behind).

Brian

> +
>  STATIC int
>  xfs_bmap_btalloc(
>  	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
> @@ -3571,19 +3743,22 @@ xfs_bmap_btalloc(
>  			*ap->firstblock = args.fsbno;
>  		ASSERT(nullfb || fb_agno <= args.agno);
>  		ap->length = args.len;
> -		if (!(ap->flags & XFS_BMAPI_COWFORK))
> -			ap->ip->i_d.di_nblocks += args.len;
> -		xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
>  		if (ap->wasdel)
>  			ap->ip->i_delayed_blks -= args.len;
> -		/*
> -		 * Adjust the disk quota also. This was reserved
> -		 * earlier.
> -		 */
> -		xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> -			ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> -					XFS_TRANS_DQ_BCOUNT,
> -			(long) args.len);
> +		if (ap->flags & XFS_BMAPI_COWFORK) {
> +			xfs_bmap_btalloc_cow(ap, &args);
> +		} else {
> +			ap->ip->i_d.di_nblocks += args.len;
> +			xfs_trans_log_inode(ap->tp, ap->ip, XFS_ILOG_CORE);
> +			/*
> +			 * Adjust the disk quota also. This was reserved
> +			 * earlier.
> +			 */
> +			xfs_trans_mod_dquot_byino(ap->tp, ap->ip,
> +				ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
> +						XFS_TRANS_DQ_BCOUNT,
> +				(long) args.len);
> +		}
>  	} else {
>  		ap->blkno = NULLFSBLOCK;
>  		ap->length = 0;
> @@ -4776,6 +4951,7 @@ xfs_bmap_del_extent_cow(
>  	struct xfs_bmbt_irec	new;
>  	xfs_fileoff_t		del_endoff, got_endoff;
>  	int			state = BMAP_COWFORK;
> +	int			error;
>  
>  	XFS_STATS_INC(mp, xs_del_exlist);
>  
> @@ -4832,6 +5008,11 @@ xfs_bmap_del_extent_cow(
>  		xfs_iext_insert(ip, icur, &new, state);
>  		break;
>  	}
> +
> +	/* Remove the quota reservation */
> +	error = xfs_trans_reserve_quota_nblks(NULL, ip,
> +			-(long)del->br_blockcount, 0, XFS_QMOPT_RES_REGBLKS);
> +	ASSERT(error == 0);
>  }
>  
>  /*
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 82abff6..e367351 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -599,10 +599,6 @@ xfs_reflink_cancel_cow_blocks(
>  					del.br_startblock, del.br_blockcount,
>  					NULL);
>  
> -			/* Update quota accounting */
> -			xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
> -					-(long)del.br_blockcount);
> -
>  			/* Roll the transaction */
>  			xfs_defer_ijoin(&dfops, ip);
>  			error = xfs_defer_finish(tpp, &dfops);
> @@ -795,6 +791,10 @@ xfs_reflink_end_cow(
>  		if (error)
>  			goto out_defer;
>  
> +		/* Charge this new data fork mapping to the on-disk quota. */
> +		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
> +				(long)del.br_blockcount);
> +
>  		/* Remove the mapping from the CoW fork. */
>  		xfs_bmap_del_extent_cow(ip, &icur, &got, &del);
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2018-01-24 14:22 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-24  2:17 [PATCH 00/11] xfs: reflink/scrub/quota fixes Darrick J. Wong
2018-01-24  2:18 ` [PATCH 01/11] xfs: reflink should break pnfs leases before sharing blocks Darrick J. Wong
2018-01-24 14:16   ` Brian Foster
2018-01-26  9:06   ` Christoph Hellwig
2018-01-26 18:26     ` Darrick J. Wong
2018-01-24  2:18 ` [PATCH 02/11] xfs: only grab shared inode locks for source file during reflink Darrick J. Wong
2018-01-24 14:18   ` Brian Foster
2018-01-24 18:40     ` Darrick J. Wong
2018-01-26 12:07   ` Christoph Hellwig
2018-01-26 18:48     ` Darrick J. Wong
2018-01-27  3:32     ` Dave Chinner
2018-01-24  2:18 ` [PATCH 03/11] xfs: call xfs_qm_dqattach before performing reflink operations Darrick J. Wong
2018-01-24 14:18   ` Brian Foster
2018-01-26  9:07   ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 04/11] xfs: CoW fork operations should only update quota reservations Darrick J. Wong
2018-01-24 14:22   ` Brian Foster [this message]
2018-01-24 19:14     ` Darrick J. Wong
2018-01-25 13:01       ` Brian Foster
2018-01-25 17:52         ` Darrick J. Wong
2018-01-25  1:20   ` [PATCH v2 " Darrick J. Wong
2018-01-25 13:03     ` Brian Foster
2018-01-25 18:20       ` Darrick J. Wong
2018-01-26 13:02         ` Brian Foster
2018-01-26 18:40           ` Darrick J. Wong
2018-01-26 12:12     ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 05/11] xfs: track CoW blocks separately in the inode Darrick J. Wong
2018-01-25 13:06   ` Brian Foster
2018-01-25 19:21     ` Darrick J. Wong
2018-01-26 13:04       ` Brian Foster
2018-01-26 19:08         ` Darrick J. Wong
2018-01-26 12:15   ` Christoph Hellwig
2018-01-26 19:00     ` Darrick J. Wong
2018-01-26 23:51       ` Darrick J. Wong
2018-01-24  2:18 ` [PATCH 06/11] xfs: fix up cowextsz allocation shortfalls Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-25 20:20     ` Darrick J. Wong
2018-01-26 13:06       ` Brian Foster
2018-01-26 19:12         ` Darrick J. Wong
2018-01-26  9:11   ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 07/11] xfs: always zero di_flags2 when we free the inode Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-25 18:36     ` Darrick J. Wong
2018-01-26  9:08   ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 08/11] xfs: fix tracepoint %p formats Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-25 18:47     ` Darrick J. Wong
2018-01-26  0:19       ` Darrick J. Wong
2018-01-26  9:09         ` Christoph Hellwig
2018-01-24  2:18 ` [PATCH 09/11] xfs: make tracepoint inode number format consistent Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-26  9:09   ` Christoph Hellwig
2018-01-24  2:19 ` [PATCH 10/11] xfs: refactor inode verifier corruption error printing Darrick J. Wong
2018-01-25 17:31   ` Brian Foster
2018-01-25 18:23     ` Darrick J. Wong
2018-01-26  9:10   ` Christoph Hellwig
2018-01-24  2:19 ` [PATCH 11/11] xfs: don't clobber inobt/finobt cursors when xref with rmap Darrick J. Wong
2018-01-26  9:10   ` Christoph Hellwig
2018-01-25  5:26 ` [PATCH 12/11] xfs: refactor quota code in xfs_bmap_btalloc Darrick J. Wong
2018-01-26 12:17   ` Christoph Hellwig
2018-01-26 21:46     ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180124142215.GD37554@bfoster.bfoster \
    --to=bfoster@redhat.com \
    --cc=darrick.wong@oracle.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.