All of lore.kernel.org
 help / color / mirror / Atom feed
From: Amir Goldstein <amir73il@gmail.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs <linux-xfs@vger.kernel.org>
Subject: Re: [PATCH 02/24] xfs: add an inode item lock
Date: Fri, 22 May 2020 09:45:54 +0300	[thread overview]
Message-ID: <CAOQ4uxh_gk5SG6dWBHGv6orty0xD017WztpM5iavbCZc-6i_Hg@mail.gmail.com> (raw)
In-Reply-To: <20200522035029.3022405-3-david@fromorbit.com>

On Fri, May 22, 2020 at 6:51 AM Dave Chinner <david@fromorbit.com> wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> The inode log item is kind of special in that it can be aggregating
> new changes in memory at the same time time existing changes are
> being written back to disk. This means there are fields in the log
> item that are accessed concurrently from contexts that don't share
> any locking at all.
>
> e.g. updating ili_last_fields occurs at flush time under the
> ILOCK_EXCL and flush lock at flush time, under the flush lock at IO
> completion time, and is read under the ILOCK_EXCL when the inode is
> logged.  Hence there is no actual serialisation between reading the
> field during logging of the inode in transactions vs clearing the
> field in IO completion.
>
> We currently get away with this by the fact that we are only
> clearing fields in IO completion, and nothing bad happens if we
> accidentally log more of the inode than we actually modify. Worst
> case is we consume a tiny bit more memory and log bandwidth.
>
> However, if we want to do more complex state manipulations on the
> log item that requires updates at all three of these potential
> locations, we need to have some mechanism of serialising those
> operations. To do this, introduce a spinlock into the log item to
> serialise internal state.
>
> This could be done via the xfs_inode i_flags_lock, but this then
> leads to potential lock inversion issues where inode flag updates
> need to occur inside locks that best nest inside the inode log item
> locks (e.g. marking inodes stale during inode cluster freeing).
> Using a separate spinlock avoids these sorts of problems and
> simplifies future code.
>
> This does not touch the use of ili_fields in the item formatting
> code - that is entirely protected by the ILOCK_EXCL at this point in
> time, so it remains untouched.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c | 44 ++++++++++++++++++---------------
>  fs/xfs/xfs_file.c               |  9 ++++---
>  fs/xfs/xfs_inode.c              | 20 +++++++++------
>  fs/xfs/xfs_inode_item.c         |  7 ++++++
>  fs/xfs/xfs_inode_item.h         |  3 ++-
>  5 files changed, 51 insertions(+), 32 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index b5dfb66548422..510b996008221 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -81,15 +81,19 @@ xfs_trans_ichgtime(
>   */
>  void
>  xfs_trans_log_inode(
> -       xfs_trans_t     *tp,
> -       xfs_inode_t     *ip,
> -       uint            flags)
> +       struct xfs_trans        *tp,
> +       struct xfs_inode        *ip,
> +       uint                    flags)
>  {
> -       struct inode    *inode = VFS_I(ip);
> +       struct xfs_inode_log_item *iip = ip->i_itemp;
> +       struct inode            *inode = VFS_I(ip);
> +       uint                    iversion_flags = 0;
>
>         ASSERT(ip->i_itemp != NULL);
>         ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>
> +       tp->t_flags |= XFS_TRANS_DIRTY;
> +
>         /*
>          * Don't bother with i_lock for the I_DIRTY_TIME check here, as races
>          * don't matter - we either will need an extra transaction in 24 hours
> @@ -102,15 +106,6 @@ xfs_trans_log_inode(
>                 spin_unlock(&inode->i_lock);
>         }
>
> -       /*
> -        * Record the specific change for fdatasync optimisation. This
> -        * allows fdatasync to skip log forces for inodes that are only
> -        * timestamp dirty. We do this before the change count so that
> -        * the core being logged in this case does not impact on fdatasync
> -        * behaviour.
> -        */
> -       ip->i_itemp->ili_fsync_fields |= flags;
> -
>         /*
>          * First time we log the inode in a transaction, bump the inode change
>          * counter if it is configured for this to occur. While we have the
> @@ -120,13 +115,21 @@ xfs_trans_log_inode(
>          * set however, then go ahead and bump the i_version counter
>          * unconditionally.
>          */
> -       if (!test_and_set_bit(XFS_LI_DIRTY, &ip->i_itemp->ili_item.li_flags) &&
> -           IS_I_VERSION(VFS_I(ip))) {
> -               if (inode_maybe_inc_iversion(VFS_I(ip), flags & XFS_ILOG_CORE))
> -                       flags |= XFS_ILOG_CORE;
> +       if (!test_and_set_bit(XFS_LI_DIRTY, &iip->ili_item.li_flags) &&
> +           IS_I_VERSION(inode)) {
> +               if (inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE))
> +                       iversion_flags = XFS_ILOG_CORE;
>         }
>
> -       tp->t_flags |= XFS_TRANS_DIRTY;
> +       /*
> +        * Record the specific change for fdatasync optimisation. This
> +        * allows fdatasync to skip log forces for inodes that are only
> +        * timestamp dirty. We do this before the change count so that
> +        * the core being logged in this case does not impact on fdatasync
> +        * behaviour.
> +        */
> +       spin_lock(&iip->ili_lock);
> +       iip->ili_fsync_fields |= flags;
>
>         /*
>          * Always OR in the bits from the ili_last_fields field.
> @@ -135,8 +138,9 @@ xfs_trans_log_inode(
>          * See the big comment in xfs_iflush() for an explanation of
>          * this coordination mechanism.
>          */
> -       flags |= ip->i_itemp->ili_last_fields;
> -       ip->i_itemp->ili_fields |= flags;
> +       flags |= iip->ili_last_fields | iversion_flags;
> +       iip->ili_fields |= flags;
> +       spin_unlock(&iip->ili_lock);
>  }
>
>  int
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 403c90309a8ff..0abf770b77498 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -94,6 +94,7 @@ xfs_file_fsync(
>  {
>         struct inode            *inode = file->f_mapping->host;
>         struct xfs_inode        *ip = XFS_I(inode);
> +       struct xfs_inode_log_item *iip = ip->i_itemp;
>         struct xfs_mount        *mp = ip->i_mount;
>         int                     error = 0;
>         int                     log_flushed = 0;
> @@ -137,13 +138,15 @@ xfs_file_fsync(
>         xfs_ilock(ip, XFS_ILOCK_SHARED);
>         if (xfs_ipincount(ip)) {
>                 if (!datasync ||
> -                   (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> -                       lsn = ip->i_itemp->ili_last_lsn;
> +                   (iip->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> +                       lsn = iip->ili_last_lsn;
>         }
>
>         if (lsn) {
>                 error = xfs_log_force_lsn(mp, lsn, XFS_LOG_SYNC, &log_flushed);
> -               ip->i_itemp->ili_fsync_fields = 0;
> +               spin_lock(&iip->ili_lock);
> +               iip->ili_fsync_fields = 0;
> +               spin_unlock(&iip->ili_lock);
>         }
>         xfs_iunlock(ip, XFS_ILOCK_SHARED);
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ca9f2688b745d..57781c0dbbec5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2683,9 +2683,11 @@ xfs_ifree_cluster(
>                                 continue;
>
>                         iip = ip->i_itemp;
> +                       spin_lock(&iip->ili_lock);
>                         iip->ili_last_fields = iip->ili_fields;
>                         iip->ili_fields = 0;
>                         iip->ili_fsync_fields = 0;
> +                       spin_unlock(&iip->ili_lock);
>                         xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>                                                 &iip->ili_item.li_lsn);
>
> @@ -2721,6 +2723,7 @@ xfs_ifree(
>  {
>         int                     error;
>         struct xfs_icluster     xic = { 0 };
> +       struct xfs_inode_log_item *iip = ip->i_itemp;
>
>         ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
>         ASSERT(VFS_I(ip)->i_nlink == 0);
> @@ -2758,7 +2761,9 @@ xfs_ifree(
>         ip->i_df.if_format = XFS_DINODE_FMT_EXTENTS;
>
>         /* Don't attempt to replay owner changes for a deleted inode */
> -       ip->i_itemp->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);
> +       spin_lock(&iip->ili_lock);
> +       iip->ili_fields &= ~(XFS_ILOG_AOWNER|XFS_ILOG_DOWNER);
> +       spin_unlock(&iip->ili_lock);
>
>         /*
>          * Bump the generation count so no one will be confused
> @@ -3814,20 +3819,19 @@ xfs_iflush_int(
>          * know that the information those bits represent is permanently on
>          * disk.  As long as the flush completes before the inode is logged
>          * again, then both ili_fields and ili_last_fields will be cleared.
> -        *
> -        * We can play with the ili_fields bits here, because the inode lock
> -        * must be held exclusively in order to set bits there and the flush
> -        * lock protects the ili_last_fields bits.  Store the current LSN of the
> -        * inode so that we can tell whether the item has moved in the AIL from
> -        * xfs_iflush_done().  In order to read the lsn we need the AIL lock,
> -        * because it is a 64 bit value that cannot be read atomically.
>          */
>         error = 0;
>  flush_out:
> +       spin_lock(&iip->ili_lock);
>         iip->ili_last_fields = iip->ili_fields;
>         iip->ili_fields = 0;
>         iip->ili_fsync_fields = 0;
> +       spin_unlock(&iip->ili_lock);
>
> +       /*
> +        * Store the current LSN of the inode so that we can tell whether the
> +        * item has moved in the AIL from xfs_iflush_done().
> +        */
>         xfs_trans_ail_copy_lsn(mp->m_ail, &iip->ili_flush_lsn,
>                                 &iip->ili_item.li_lsn);
>
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index b17384aa8df40..6ef9cbcfc94a7 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -637,6 +637,7 @@ xfs_inode_item_init(
>         iip = ip->i_itemp = kmem_zone_zalloc(xfs_ili_zone, 0);
>
>         iip->ili_inode = ip;
> +       spin_lock_init(&iip->ili_lock);
>         xfs_log_item_init(mp, &iip->ili_item, XFS_LI_INODE,
>                                                 &xfs_inode_item_ops);
>  }
> @@ -738,7 +739,11 @@ xfs_iflush_done(
>         list_for_each_entry_safe(blip, n, &tmp, li_bio_list) {
>                 list_del_init(&blip->li_bio_list);
>                 iip = INODE_ITEM(blip);
> +
> +               spin_lock(&iip->ili_lock);
>                 iip->ili_last_fields = 0;
> +               spin_unlock(&iip->ili_lock);
> +
>                 xfs_ifunlock(iip->ili_inode);
>         }
>         list_del(&tmp);
> @@ -762,9 +767,11 @@ xfs_iflush_abort(
>                  * Clear the inode logging fields so no more flushes are
>                  * attempted.
>                  */
> +               spin_lock(&iip->ili_lock);
>                 iip->ili_last_fields = 0;
>                 iip->ili_fields = 0;
>                 iip->ili_fsync_fields = 0;
> +               spin_unlock(&iip->ili_lock);
>         }
>         /*
>          * Release the inode's flush lock since we're done with it.
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index 4de5070e07655..1234e8cd3726d 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -18,7 +18,8 @@ struct xfs_inode_log_item {
>         struct xfs_inode        *ili_inode;        /* inode ptr */
>         xfs_lsn_t               ili_flush_lsn;     /* lsn at last flush */
>         xfs_lsn_t               ili_last_lsn;      /* lsn at last transaction */
> -       unsigned short          ili_lock_flags;    /* lock flags */
> +       spinlock_t              ili_lock;          /* internal state lock */

"internal state" is a fluid term.
It would be more useful to document "Protects ..." as in i_lock/f_lock.

For verifying unchanged logic from code re-organization and locking
balance:

Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks,
Amir.

  reply	other threads:[~2020-05-22  6:46 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-22  3:50 [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-05-22  3:50 ` [PATCH 01/24] xfs: remove logged flag from inode log item Dave Chinner
2020-05-22  7:25   ` Christoph Hellwig
2020-05-22 21:13   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 02/24] xfs: add an inode item lock Dave Chinner
2020-05-22  6:45   ` Amir Goldstein [this message]
2020-05-22 21:24   ` Darrick J. Wong
2020-05-23  8:45   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 03/24] xfs: mark inode buffers in cache Dave Chinner
2020-05-22  7:45   ` Amir Goldstein
2020-05-22 21:35   ` Darrick J. Wong
2020-05-24 23:41     ` Dave Chinner
2020-05-23  8:48   ` Christoph Hellwig
2020-05-25  0:06     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 04/24] xfs: mark dquot " Dave Chinner
2020-05-22  7:46   ` Amir Goldstein
2020-05-22 21:38   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 05/24] xfs: mark log recovery buffers for completion Dave Chinner
2020-05-22  7:41   ` Amir Goldstein
2020-05-24 23:54     ` Dave Chinner
2020-05-22 21:41   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 06/24] xfs: call xfs_buf_iodone directly Dave Chinner
2020-05-22  7:56   ` Amir Goldstein
2020-05-22 21:53   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 07/24] xfs: clean up whacky buffer log item list reinit Dave Chinner
2020-05-22 22:01   ` Darrick J. Wong
2020-05-23  8:50   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 08/24] xfs: fold xfs_istale_done into xfs_iflush_done Dave Chinner
2020-05-22 22:10   ` Darrick J. Wong
2020-05-23  9:12   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 09/24] xfs: use direct calls for dquot IO completion Dave Chinner
2020-05-22 22:13   ` Darrick J. Wong
2020-05-23  9:16   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 10/24] xfs: clean up the buffer iodone callback functions Dave Chinner
2020-05-22 22:26   ` Darrick J. Wong
2020-05-25  0:37     ` Dave Chinner
2020-05-23  9:19   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 11/24] xfs: get rid of log item callbacks Dave Chinner
2020-05-22 22:27   ` Darrick J. Wong
2020-05-23  9:19   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 12/24] xfs: pin inode backing buffer to the inode log item Dave Chinner
2020-05-22 22:39   ` Darrick J. Wong
2020-05-23  9:34   ` Christoph Hellwig
2020-05-23 21:43     ` Dave Chinner
2020-05-24  5:31       ` Christoph Hellwig
2020-05-24 23:13         ` Dave Chinner
2020-05-22  3:50 ` [PATCH 13/24] xfs: make inode reclaim almost non-blocking Dave Chinner
2020-05-22 12:19   ` Amir Goldstein
2020-05-22 22:48   ` Darrick J. Wong
2020-05-23 22:29     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
2020-05-22 23:06   ` Darrick J. Wong
2020-05-25  3:49     ` Dave Chinner
2020-05-23  9:40   ` Christoph Hellwig
2020-05-23 22:35     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 15/24] xfs: allow multiple reclaimers per AG Dave Chinner
2020-05-22 23:10   ` Darrick J. Wong
2020-05-23 22:35     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 16/24] xfs: don't block inode reclaim on the ILOCK Dave Chinner
2020-05-22 23:11   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 17/24] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
2020-05-22 23:14   ` Darrick J. Wong
2020-05-23 22:42     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 18/24] xfs: clean up inode reclaim comments Dave Chinner
2020-05-22 23:17   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 19/24] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
2020-05-22 23:48   ` Darrick J. Wong
2020-05-23 22:59     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 20/24] xfs: xfs_iflush() is no longer necessary Dave Chinner
2020-05-22 23:54   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 21/24] xfs: rename xfs_iflush_int() Dave Chinner
2020-05-22 12:33   ` Amir Goldstein
2020-05-22 23:57   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
2020-05-23  0:13   ` Darrick J. Wong
2020-05-23 23:14     ` Dave Chinner
2020-05-23 11:31   ` Christoph Hellwig
2020-05-23 23:23     ` Dave Chinner
2020-05-24  5:32       ` Christoph Hellwig
2020-05-23 11:39   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 23/24] xfs: factor xfs_iflush_done Dave Chinner
2020-05-23  0:20   ` Darrick J. Wong
2020-05-23 11:35   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 24/24] xfs: remove xfs_inobp_check() Dave Chinner
2020-05-23  0:16   ` Darrick J. Wong
2020-05-23 11:36   ` Christoph Hellwig
2020-05-22  4:04 ` [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Dave Chinner
2020-05-23 16:18   ` Darrick J. Wong
2020-05-23 21:22     ` Dave Chinner
2020-05-22  6:18 ` Amir Goldstein
2020-05-22 12:01   ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAOQ4uxh_gk5SG6dWBHGv6orty0xD017WztpM5iavbCZc-6i_Hg@mail.gmail.com \
    --to=amir73il@gmail.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.