Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous
Date: Fri, 22 May 2020 14:04:01 +1000
Message-ID: <20200522040401.GE2040@dread.disaster.area> (raw)
In-Reply-To: <20200522035029.3022405-1-david@fromorbit.com>


FWIW, I forgot to put it in the original description - the series
can be pulled from my git tree here:

git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git xfs-async-inode-reclaim

Cheers,

Dave.

On Fri, May 22, 2020 at 01:50:05PM +1000, Dave Chinner wrote:
> Hi folks,
> 
> Inode flushing requires that we first lock an inode, then check it,
> then lock the underlying buffer, flush the inode to the buffer and
> finally add the inode to the buffer to be unlocked on IO completion.
> We then walk all the other cached inodes in the buffer range and
> optimistically lock and flush them to the buffer without blocking.
> 
> This cluster write effectively repeats the same code we do with the
> initial inode, except now it has to special case that initial inode
> that is already locked. Hence we have multiple copies of very
> similar code, and it is a result of inode cluster flushing being
> based on a specific inode rather than grabbing the buffer and
> flushing all available inodes to it.
> 
> The problem with this at the moment is that we we can't look up the
> buffer until we have guaranteed that an inode is held exclusively
> and it's not going away while we get the buffer through an imap
> lookup. Hence we are kinda stuck locking an inode before we can look
> up the buffer.
> 
> This is also a result of inodes being detached from the cluster
> buffer except when IO is being done. This has the further problem
> that the cluster buffer can be reclaimed from memory and then the
> inode can be dirtied. At this point cleaning the inode requires a
> read-modify-write cycle on the cluster buffer. If we then are put
> under memory pressure, cleaning that dirty inode to reclaim it
> requires allocating memory for the cluster buffer and this leads to
> all sorts of problems.
> 
> We used synchronous inode writeback in reclaim as a throttle that
> provided a forwards progress mechanism when RMW cycles were required
> to clean inodes. Async writeback of inodes (e.g. via the AIL) would
> immediately exhaust remaining memory reserves trying to allocate
> inode cluster after inode cluster. The synchronous writeback of an
> inode cluster allowed reclaim to release the inode cluster and have
> it freed almost immediately which could then be used to allocate the
> next inode cluster buffer. Hence the IO based throttling mechanism
> largely guaranteed forwards progress in inode reclaim. By removing
> the requirement for require memory allocation for inode writeback
> filesystem level, we can issue writeback asynchrnously and not have
> to worry about the memory exhaustion anymore.
> 
> Another issue is that if we have slow disks, we can build up dirty
> inodes in memory that can then take hours for an operation like
> unmount to flush. A RMW cycle per inode on a slow RAID6 device can
> mean we only clean 50 inodes a second, and when there are hundreds
> of thousands of dirty inodes that need to be cleaned this can take a
> long time. PInning the cluster buffers will greatly speed up inode
> writeback on slow storage systems like this.
> 
> These limitations all stem from the same source: inode writeback is
> inode centric, And they are largely solved by the same architectural
> change: make inode writeback cluster buffer centric.  This series is
> makes that architectural change.
> 
> Firstly, we start by pinning the inode backing buffer in memory
> when an inode is marked dirty (i.e. when it is logged). By tracking
> the number of dirty inodes on a buffer as a counter rather than a
> flag, we avoid the problem of overlapping inode dirtying and buffer
> flushing racing to set/clear the dirty flag. Hence as long as there
> is a dirty inode in memory, the buffer will not be able to be
> reclaimed. We can safely do this inode cluster buffer lookup when we
> dirty an inode as we do not hold the buffer locked - we merely take
> a reference to it and then release it - and hence we don't cause any
> new lock order issues.
> 
> When the inode is finally cleaned, the reference to the buffer can
> be removed from the inode log item and the buffer released. This is
> done from the inode completion callbacks that are attached to the
> buffer when the inode is flushed.
> 
> Pinning the cluster buffer in this way immediately avoids the RMW
> problem in inode writeback and reclaim contexts by moving the memory
> allocation and the blocking buffer read into the transaction context
> that dirties the inode.  This inverts our dirty inode throttling
> mechanism - we now throttle the rate at which we can dirty inodes to
> rate at which we can allocate memory and read inode cluster buffers
> into memory rather than via throttling reclaim to rate at which we
> can clean dirty inodes.
> 
> Hence if we are under memory pressure, we'll block on memory
> allocation when trying to dirty the referenced inode, rather than in
> the memory reclaim path where we are trying to clean unreferenced
> inodes to free memory.  Hence we no longer have to guarantee
> forwards progress in inode reclaim as we aren't doing memory
> allocation, and that means we can remove inode writeback from the
> XFS inode shrinker completely without changing the system tolerance
> for low memory operation.
> 
> Tracking the buffers via the inode log item also allows us to
> completely rework the inode flushing mechanism. While the inode log
> item is in the AIL, it is safe for the AIL to access any member of
> the log item. Hence the AIL push mechanisms can access the buffer
> attached to the inode without first having to lock the inode.
> 
> This means we can essentially lock the buffer directly and then
> call xfs_iflush_cluster() without first going through xfs_iflush()
> to find the buffer. Hence we can remove xfs_iflush() altogether,
> because the two places that call it - the inode item push code and
> inode reclaim - no longer need to flush inodes directly.
> 
> This can be further optimised by attaching the inode to the cluster
> buffer when the inode is dirtied. i.e. when we add the buffer
> reference to the inode log item, we also attach the inode to the
> buffer for IO processing. This leads to the dirty inodes always
> being attached to the buffer and hence we no longer need to add them
> when we flush the inode and remove them when IO completes. Instead
> the inodes are attached when the node log item is dirtied, and
> removed when the inode log item is cleaned.
> 
> With this structure in place, we no longer need to do
> lookups to find the dirty inodes in the cache to attach to the
> buffer in xfs_iflush_cluster() - they are already attached to the
> buffer. Hence when the AIL pushes an inode, we just grab the buffer
> from the log item, and then walk the buffer log item list to lock
> and flush the dirty inodes attached to the buffer.
> 
> This greatly simplifies inode writeback, and removes another memory
> allocation from the inode writeback path (the array used for the
> radix tree gang lookup). And while the radix tree lookups are fast,
> walking the linked list of dirty inodes is faster.
> 
> There is followup work I am doing that uses the inode cluster buffer
> as a replacement in the AIL for tracking dirty inodes. This part of
> the series is not ready yet as it has some intricate locking
> requirements. That is an optimisation, so I've left that out because
> solving the inode reclaim blocking problems is the important part of
> this work.
> 
> In short, this series simplifies inode writeback and fixes the long
> standing inode reclaim blocking issues without requiring any changes
> to the memory reclaim infrastructure.
> 
> Note: dquots should probably be converted to cluster flushing in a
> similar way, as they have many of the same issues as inode flushing.
> 
> Thoughts, comments and improvemnts welcome.
> 
> -Dave.
> 
> 
> 
> Dave Chinner (24):
>   xfs: remove logged flag from inode log item
>   xfs: add an inode item lock
>   xfs: mark inode buffers in cache
>   xfs: mark dquot buffers in cache
>   xfs: mark log recovery buffers for completion
>   xfs: call xfs_buf_iodone directly
>   xfs: clean up whacky buffer log item list reinit
>   xfs: fold xfs_istale_done into xfs_iflush_done
>   xfs: use direct calls for dquot IO completion
>   xfs: clean up the buffer iodone callback functions
>   xfs: get rid of log item callbacks
>   xfs: pin inode backing buffer to the inode log item
>   xfs: make inode reclaim almost non-blocking
>   xfs: remove IO submission from xfs_reclaim_inode()
>   xfs: allow multiple reclaimers per AG
>   xfs: don't block inode reclaim on the ILOCK
>   xfs: remove SYNC_TRYLOCK from inode reclaim
>   xfs: clean up inode reclaim comments
>   xfs: attach inodes to the cluster buffer when dirtied
>   xfs: xfs_iflush() is no longer necessary
>   xfs: rename xfs_iflush_int()
>   xfs: rework xfs_iflush_cluster() dirty inode iteration
>   xfs: factor xfs_iflush_done
>   xfs: remove xfs_inobp_check()
> 
>  fs/xfs/libxfs/xfs_inode_buf.c   |  27 +-
>  fs/xfs/libxfs/xfs_inode_buf.h   |   6 -
>  fs/xfs/libxfs/xfs_trans_inode.c | 108 +++++--
>  fs/xfs/xfs_buf.c                |  44 ++-
>  fs/xfs/xfs_buf.h                |  49 +--
>  fs/xfs/xfs_buf_item.c           | 205 +++++--------
>  fs/xfs/xfs_buf_item.h           |   8 +-
>  fs/xfs/xfs_buf_item_recover.c   |   5 +-
>  fs/xfs/xfs_dquot.c              |  32 +-
>  fs/xfs/xfs_dquot.h              |   1 +
>  fs/xfs/xfs_dquot_item_recover.c |   2 +-
>  fs/xfs/xfs_file.c               |   9 +-
>  fs/xfs/xfs_icache.c             | 293 +++++-------------
>  fs/xfs/xfs_inode.c              | 515 +++++++++++---------------------
>  fs/xfs/xfs_inode.h              |   2 +-
>  fs/xfs/xfs_inode_item.c         | 281 ++++++++---------
>  fs/xfs/xfs_inode_item.h         |   9 +-
>  fs/xfs/xfs_inode_item_recover.c |   2 +-
>  fs/xfs/xfs_log_recover.c        |   5 +-
>  fs/xfs/xfs_mount.c              |   4 -
>  fs/xfs/xfs_mount.h              |   1 -
>  fs/xfs/xfs_trans.h              |   3 -
>  fs/xfs/xfs_trans_buf.c          |  15 +-
>  fs/xfs/xfs_trans_priv.h         |  12 +-
>  24 files changed, 680 insertions(+), 958 deletions(-)
> 
> -- 
> 2.26.2.761.g0e0b3e54be
> 
> 

-- 
Dave Chinner
david@fromorbit.com

  parent reply index

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-22  3:50 Dave Chinner
2020-05-22  3:50 ` [PATCH 01/24] xfs: remove logged flag from inode log item Dave Chinner
2020-05-22  7:25   ` Christoph Hellwig
2020-05-22 21:13   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 02/24] xfs: add an inode item lock Dave Chinner
2020-05-22  6:45   ` Amir Goldstein
2020-05-22 21:24   ` Darrick J. Wong
2020-05-23  8:45   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 03/24] xfs: mark inode buffers in cache Dave Chinner
2020-05-22  7:45   ` Amir Goldstein
2020-05-22 21:35   ` Darrick J. Wong
2020-05-24 23:41     ` Dave Chinner
2020-05-23  8:48   ` Christoph Hellwig
2020-05-25  0:06     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 04/24] xfs: mark dquot " Dave Chinner
2020-05-22  7:46   ` Amir Goldstein
2020-05-22 21:38   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 05/24] xfs: mark log recovery buffers for completion Dave Chinner
2020-05-22  7:41   ` Amir Goldstein
2020-05-24 23:54     ` Dave Chinner
2020-05-22 21:41   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 06/24] xfs: call xfs_buf_iodone directly Dave Chinner
2020-05-22  7:56   ` Amir Goldstein
2020-05-22 21:53   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 07/24] xfs: clean up whacky buffer log item list reinit Dave Chinner
2020-05-22 22:01   ` Darrick J. Wong
2020-05-23  8:50   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 08/24] xfs: fold xfs_istale_done into xfs_iflush_done Dave Chinner
2020-05-22 22:10   ` Darrick J. Wong
2020-05-23  9:12   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 09/24] xfs: use direct calls for dquot IO completion Dave Chinner
2020-05-22 22:13   ` Darrick J. Wong
2020-05-23  9:16   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 10/24] xfs: clean up the buffer iodone callback functions Dave Chinner
2020-05-22 22:26   ` Darrick J. Wong
2020-05-25  0:37     ` Dave Chinner
2020-05-23  9:19   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 11/24] xfs: get rid of log item callbacks Dave Chinner
2020-05-22 22:27   ` Darrick J. Wong
2020-05-23  9:19   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 12/24] xfs: pin inode backing buffer to the inode log item Dave Chinner
2020-05-22 22:39   ` Darrick J. Wong
2020-05-23  9:34   ` Christoph Hellwig
2020-05-23 21:43     ` Dave Chinner
2020-05-24  5:31       ` Christoph Hellwig
2020-05-24 23:13         ` Dave Chinner
2020-05-22  3:50 ` [PATCH 13/24] xfs: make inode reclaim almost non-blocking Dave Chinner
2020-05-22 12:19   ` Amir Goldstein
2020-05-22 22:48   ` Darrick J. Wong
2020-05-23 22:29     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 14/24] xfs: remove IO submission from xfs_reclaim_inode() Dave Chinner
2020-05-22 23:06   ` Darrick J. Wong
2020-05-25  3:49     ` Dave Chinner
2020-05-23  9:40   ` Christoph Hellwig
2020-05-23 22:35     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 15/24] xfs: allow multiple reclaimers per AG Dave Chinner
2020-05-22 23:10   ` Darrick J. Wong
2020-05-23 22:35     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 16/24] xfs: don't block inode reclaim on the ILOCK Dave Chinner
2020-05-22 23:11   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 17/24] xfs: remove SYNC_TRYLOCK from inode reclaim Dave Chinner
2020-05-22 23:14   ` Darrick J. Wong
2020-05-23 22:42     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 18/24] xfs: clean up inode reclaim comments Dave Chinner
2020-05-22 23:17   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 19/24] xfs: attach inodes to the cluster buffer when dirtied Dave Chinner
2020-05-22 23:48   ` Darrick J. Wong
2020-05-23 22:59     ` Dave Chinner
2020-05-22  3:50 ` [PATCH 20/24] xfs: xfs_iflush() is no longer necessary Dave Chinner
2020-05-22 23:54   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 21/24] xfs: rename xfs_iflush_int() Dave Chinner
2020-05-22 12:33   ` Amir Goldstein
2020-05-22 23:57   ` Darrick J. Wong
2020-05-22  3:50 ` [PATCH 22/24] xfs: rework xfs_iflush_cluster() dirty inode iteration Dave Chinner
2020-05-23  0:13   ` Darrick J. Wong
2020-05-23 23:14     ` Dave Chinner
2020-05-23 11:31   ` Christoph Hellwig
2020-05-23 23:23     ` Dave Chinner
2020-05-24  5:32       ` Christoph Hellwig
2020-05-23 11:39   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 23/24] xfs: factor xfs_iflush_done Dave Chinner
2020-05-23  0:20   ` Darrick J. Wong
2020-05-23 11:35   ` Christoph Hellwig
2020-05-22  3:50 ` [PATCH 24/24] xfs: remove xfs_inobp_check() Dave Chinner
2020-05-23  0:16   ` Darrick J. Wong
2020-05-23 11:36   ` Christoph Hellwig
2020-05-22  4:04 ` Dave Chinner [this message]
2020-05-23 16:18   ` [PATCH 00/24] xfs: rework inode flushing to make inode reclaim fully asynchronous Darrick J. Wong
2020-05-23 21:22     ` Dave Chinner
2020-05-22  6:18 ` Amir Goldstein
2020-05-22 12:01   ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200522040401.GE2040@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git