Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 02/24] xfs: add an inode item lock
Date: Fri, 22 May 2020 14:24:36 -0700
Message-ID: <20200522212436.GF8230@magnolia> (raw)
In-Reply-To: <20200522035029.3022405-3-david@fromorbit.com>

On Fri, May 22, 2020 at 01:50:07PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The inode log item is kind of special in that it can be aggregating
> new changes in memory at the same time time existing changes are
> being written back to disk. This means there are fields in the log
> item that are accessed concurrently from contexts that don't share
> any locking at all.
> 
> e.g. updating ili_last_fields occurs at flush time under the
> ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
> completion time, and is read under the ILOCK_EXCL when the inode is
> logged.  Hence there is no actual serialisation between reading the
> field during logging of the inode in transactions vs clearing the
> field in IO completion.
> 
> We currently get away with this by the fact that we are only
> clearing fields in IO completion, and nothing bad happens if we
> accidentally log more of the inode than we actually modify. Worst
> case is we consume a tiny bit more memory and log bandwidth.
> 
> However, if we want to do more complex state manipulations on the
> log item that requires updates at all three of these potential
> locations, we need to have some mechanism of serialising those
> operations. To do this, introduce a spinlock into the log item to
> serialise internal state.
> 
> This could be done via the xfs_inode i_flags_lock, but this then
> leads to potential lock inversion issues where inode flag updates
> need to occur inside locks that best nest inside the inode log item
> locks (e.g. marking inodes stale during inode cluster freeing).
> Using a separate spinlock avoids these sorts of problems and
> simplifies future code.
> 
> This does not touch the use of ili_fields in the item formatting
> code - that is entirely protected by the ILOCK_EXCL at this point in
> time, so it remains untouched.

...at first lgance I thought it was odd that this patch then takes
ili_lock before clearing ili_fields, but then I realized that yes indeed
we also hold the ILOCK any time we do that.

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 44 ++++++++++++++++++---------------
>  fs/xfs/xfs_file.c               |  9 ++++---
>  fs/xfs/xfs_inode.c              | 20 +++++++++------
>  fs/xfs/xfs_inode_item.c         |  7 ++++++
>  fs/xfs/xfs_inode_item.h         |  3 ++-
>  5 files changed, 51 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index b5dfb66548422..510b996008221 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -81,15 +81,19 @@ xfs_trans_ichgtime(
>   */
>  void
>  xfs_trans_log_inode(
> -	xfs_trans_t	*tp,
> -	xfs_inode_t	*ip,
> -	uint		flags)
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip,
> +	uint			flags)
>  {
> -	struct inode	*inode = VFS_I(ip);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
> +	struct inode		*inode = VFS_I(ip);
> +	uint			iversion_flags = 0;
>  
>  	ASSERT(ip->i_itemp != NULL);
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  
> +	tp->t_flags |= XFS_TRANS_DIRTY;
> +
>  	/*
>  	 * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
>  	 * don't matter - we either will need an extra transaction in 24 hours
> @@ -102,15 +106,6 @@ xfs_trans_log_inode(
>  		spin_unlock(&inode->i_lock);
>  	}
>  
> -	/*
> -	 * Record the specific change for fdatasync optimisation. This
> -	 * allows fdatasync to skip log forces for inodes that are only
> -	 * timestamp dirty. We do this before the change count so that
> -	 * the core being logged in this case does not impact on fdatasync
> -	 * behaviour.
> -	 */
> -	ip->i_itemp->ili_fsync_fields |= flags;
> -
>  	/*
>  	 * First time we log the inode in a transaction, bump the inode change
>  	 * counter if it is configured for this to occur. While we have the
> @@ -120,13 +115,21 @@ xfs_trans_log_inode(
>  	 * set however, then go ahead and bump the i_version counter
>  	 * unconditionally.
>  	 */
> -	if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
> -	    IS_I_VERSION(VFS_I(ip))) {
> -		if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> -			flags |= XFS_ILOG_CORE;
> +	if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags) &&
> +	    IS_I_VERSION(inode)) {
> +		if (inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> +			iversion_flags = XFS_ILOG_CORE;
>  	}
>  
> -	tp->t_flags |= XFS_TRANS_DIRTY;
> +	/*
> +	 * Record the specific change for fdatasync optimisation. This
> +	 * allows fdatasync to skip log forces for inodes that are only
> +	 * timestamp dirty. We do this before the change count so that
> +	 * the core being logged in this case does not impact on fdatasync
> +	 * behaviour.
> +	 */
> +	spin_lock(&iip->ili_lock);
> +	iip->ili_fsync_fields |= flags;
>  
>  	/*
>  	 * Always OR in the bits from the ili_last_fields field.
> @@ -135,8 +138,9 @@ xfs_trans_log_inode(
>  	 * See the big comment in xfs_iflush() for an explanation of
>  	 * this coordination mechanism.
>  	 */
> -	flags |= ip->i_itemp->ili_last_fields;
> -	ip->i_itemp->ili_fields |= flags;
> +	flags |= iip->ili_last_fields | iversion_flags;
> +	iip->ili_fields |= flags;
> +	spin_unlock(&iip->ili_lock);
>  }
>  
>  int
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 403c90309a8ff..0abf770b77498 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -94,6 +94,7 @@ xfs_file_fsync(
>  {
>  	struct inode		*inode = file->f_mapping->host;
>  	struct xfs_inode	*ip = XFS_I(inode);
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error = 0;
>  	int			log_flushed = 0;
> @@ -137,13 +138,15 @@ xfs_file_fsync(
>  	xfs_ilock(ip, XFS_ILOCK_SHARED);
>  	if (xfs_ipincount(ip)) {
>  		if (!datasync ||
> -		    (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> -			lsn = ip->i_itemp->ili_last_lsn;
> +		    (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> +			lsn = iip->ili_last_lsn;
>  	}
>  
>  	if (lsn) {
>  		error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> -		ip->i_itemp->ili_fsync_fields = 0;
> +		spin_lock(&iip->ili_lock);
> +		iip->ili_fsync_fields = 0;
> +		spin_unlock(&iip->ili_lock);
>  	}
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ca9f2688b745d..57781c0dbbec5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2683,9 +2683,11 @@ xfs_ifree_cluster(
>  				continue;
>  
>  			iip = ip->i_itemp;
> +			spin_lock(&iip->ili_lock);
>  			iip->ili_last_fields = iip->ili_fields;
>  			iip->ili_fields = 0;
>  			iip->ili_fsync_fields = 0;
> +			spin_unlock(&iip->ili_lock);
>  			xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  						&iip->ili_item.li_lsn);
>  
> @@ -2721,6 +2723,7 @@ xfs_ifree(
>  {
>  	int			error;
>  	struct xfs_icluster	xic = { 0 };
> +	struct xfs_inode_log_item *iip = ip->i_itemp;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>  	ASSERT(VFS_I(ip)->i_nlink == 0);
> @@ -2758,7 +2761,9 @@ xfs_ifree(
>  	ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS;
>  
>  	/* Don't attempt to replay owner changes for a deleted inode */
> -	ip->i_itemp->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);
> +	spin_lock(&iip->ili_lock);
> +	iip->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);

Need spaces around the '|'...

> +	spin_unlock(&iip->ili_lock);
>  
>  	/*
>  	 * Bump the generation count so no one will be confused
> @@ -3814,20 +3819,19 @@ xfs_iflush_int(
>  	 * know that the information those bits represent is permanently on
>  	 * disk.  As long as the flush completes before the inode is logged
>  	 * again, then both ili_fields and ili_last_fields will be cleared.
> -	 *
> -	 * We can play with the ili_fields bits here, because the inode lock
> -	 * must be held exclusively in order to set bits there and the flush
> -	 * lock protects the ili_last_fields bits.  Store the current LSN of the
> -	 * inode so that we can tell whether the item has moved in the AIL from
> -	 * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
> -	 * because it is a 64 bit value that cannot be read atomically.
>  	 */
>  	error = 0;
>  flush_out:
> +	spin_lock(&iip->ili_lock);
>  	iip->ili_last_fields = iip->ili_fields;
>  	iip->ili_fields = 0;
>  	iip->ili_fsync_fields = 0;
> +	spin_unlock(&iip->ili_lock);
>  
> +	/*
> +	 * Store the current LSN of the inode so that we can tell whether the
> +	 * item has moved in the AIL from xfs_iflush_done().
> +	 */
>  	xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>  				&iip->ili_item.li_lsn);
>  
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index b17384aa8df40..6ef9cbcfc94a7 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -637,6 +637,7 @@ xfs_inode_item_init(
>  	iip = ip->i_itemp = kmem_zone_zalloc(xfs_ili_zone, 0);
>  
>  	iip->ili_inode = ip;
> +	spin_lock_init(&iip->ili_lock);
>  	xfs_log_item_init(mp, &iip->ili_item, XFS_LI_INODE,
>  						&xfs_inode_item_ops);
>  }
> @@ -738,7 +739,11 @@ xfs_iflush_done(
>  	list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
>  		list_del_init(&blip->li_bio_list);
>  		iip = INODE_ITEM(blip);
> +
> +		spin_lock(&iip->ili_lock);
>  		iip->ili_last_fields = 0;
> +		spin_unlock(&iip->ili_lock);
> +
>  		xfs_ifunlock(iip->ili_inode);
>  	}
>  	list_del(&tmp);
> @@ -762,9 +767,11 @@ xfs_iflush_abort(
>  		 * Clear the inode logging fields so no more flushes are
>  		 * attempted.
>  		 */
> +		spin_lock(&iip->ili_lock);
>  		iip->ili_last_fields = 0;
>  		iip->ili_fields = 0;
>  		iip->ili_fsync_fields = 0;
> +		spin_unlock(&iip->ili_lock);
>  	}
>  	/*
>  	 * Release the inode's flush lock since we're done with it.
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 4de5070e07655..1234e8cd3726d 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -18,7 +18,8 @@ struct xfs_inode_log_item {
>  	struct xfs_inode	*ili_inode;	   /* inode ptr */
>  	xfs_lsn_t		ili_flush_lsn;	   /* lsn at last flush */
>  	xfs_lsn_t		ili_last_lsn;	   /* lsn at last transaction */
> -	unsigned short		ili_lock_flags;	   /* lock flags */
> +	spinlock_t		ili_lock;	   /* internal state lock */

"internal state lock" is pretty vague to me.  I interpret the comment to
mean that ili_lock protects everything in this structure, when in fact
it doesn't cover ili_fields despite us holding ili_lock for *some* of
the ili_fields updates.

The comment for this lock ought to point out which fields it protects,
since it doesn't cover the entire structure.

--D

> +	unsigned short		ili_lock_flags;	   /* inode lock flags */
>  	unsigned int		ili_last_fields;   /* fields when flushed */
>  	unsigned int		ili_fields;	   /* fields to be logged */
>  	unsigned int		ili_fsync_fields;  /* logged since last fsync */
> -- 
> 2.26.2.761.g0e0b3e54be
> 

  parent reply index

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-05-22  3:50 ` [PATCH 01/24] xfs: remove logged flag from inode log item Dave Chinner
2020-05-22  7:25   ` Christoph Hellwig
2020-05-22 21:13   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 02/24] xfs: add an inode item lock Dave Chinner
2020-05-22  6:45   ` Amir Goldstein
2020-05-22 21:24   ` Darrick J. Wong [this message]
2020-05-23  8:45   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 03/24] xfs: mark inode buffers in cache Dave Chinner
2020-05-22  7:45   ` Amir Goldstein
2020-05-22 21:35   ` Darrick J. Wong
2020-05-24 23:41     ` Dave Chinner
2020-05-23  8:48   ` Christoph Hellwig
2020-05-25  0:06     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 04/24] xfs: mark dquot " Dave Chinner
2020-05-22  7:46   ` Amir Goldstein
2020-05-22 21:38   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 05/24] xfs: mark log recovery buffers for completion Dave Chinner
2020-05-22  7:41   ` Amir Goldstein
2020-05-24 23:54     ` Dave Chinner
2020-05-22 21:41   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 06/24] xfs: call xfs_buf_iodone directly Dave Chinner
2020-05-22  7:56   ` Amir Goldstein
2020-05-22 21:53   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 07/24] xfs: clean up whacky buffer log item list reinit Dave Chinner
2020-05-22 22:01   ` Darrick J. Wong
2020-05-23  8:50   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 08/24] xfs: fold xfs_istale_done into xfs_iflush_done Dave Chinner
2020-05-22 22:10   ` Darrick J. Wong
2020-05-23  9:12   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 09/24] xfs: use direct calls for dquot IO completion Dave Chinner
2020-05-22 22:13   ` Darrick J. Wong
2020-05-23  9:16   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 10/24] xfs: clean up the buffer iodone callback functions Dave Chinner
2020-05-22 22:26   ` Darrick J. Wong
2020-05-25  0:37     ` Dave Chinner
2020-05-23  9:19   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 11/24] xfs: get rid of log item callbacks Dave Chinner
2020-05-22 22:27   ` Darrick J. Wong
2020-05-23  9:19   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 12/24] xfs: pin inode backing buffer to the inode log item Dave Chinner
2020-05-22 22:39   ` Darrick J. Wong
2020-05-23  9:34   ` Christoph Hellwig
2020-05-23 21:43     ` Dave Chinner
2020-05-24  5:31       ` Christoph Hellwig
2020-05-24 23:13         ` Dave Chinner
2020-05-22  3:50 ` [PATCH 13/24] xfs: make inode reclaim almost non-blocking Dave Chinner
2020-05-22 12:19   ` Amir Goldstein
2020-05-22 22:48   ` Darrick J. Wong
2020-05-23 22:29     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
2020-05-22 23:06   ` Darrick J. Wong
2020-05-25  3:49     ` Dave Chinner
2020-05-23  9:40   ` Christoph Hellwig
2020-05-23 22:35     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 15/24] xfs: allow multiple reclaimers per AG Dave Chinner
2020-05-22 23:10   ` Darrick J. Wong
2020-05-23 22:35     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 16/24] xfs: don't block inode reclaim on the ILOCK Dave Chinner
2020-05-22 23:11   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 17/24] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
2020-05-22 23:14   ` Darrick J. Wong
2020-05-23 22:42     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 18/24] xfs: clean up inode reclaim comments Dave Chinner
2020-05-22 23:17   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 19/24] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
2020-05-22 23:48   ` Darrick J. Wong
2020-05-23 22:59     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 20/24] xfs: xfs_iflush() is no longer necessary Dave Chinner
2020-05-22 23:54   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 21/24] xfs: rename xfs_iflush_int() Dave Chinner
2020-05-22 12:33   ` Amir Goldstein
2020-05-22 23:57   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
2020-05-23  0:13   ` Darrick J. Wong
2020-05-23 23:14     ` Dave Chinner
2020-05-23 11:31   ` Christoph Hellwig
2020-05-23 23:23     ` Dave Chinner
2020-05-24  5:32       ` Christoph Hellwig
2020-05-23 11:39   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 23/24] xfs: factor xfs_iflush_done Dave Chinner
2020-05-23  0:20   ` Darrick J. Wong
2020-05-23 11:35   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 24/24] xfs: remove xfs_inobp_check() Dave Chinner
2020-05-23  0:16   ` Darrick J. Wong
2020-05-23 11:36   ` Christoph Hellwig
2020-05-22  4:04 ` [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-05-23 16:18   ` Darrick J. Wong
2020-05-23 21:22     ` Dave Chinner
2020-05-22  6:18 ` Amir Goldstein
2020-05-22 12:01   ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200522212436.GF8230@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git